Open bmschmidt opened 8 years ago
I agree that it is better just to trim the range than stop entirely. Can you try the version I just pushed? Note that Orestes doesn't appear until 1828, so this might be a better test:
gender("Orestes",years=c(1920-20,1920+20),method="ipums")
Super, thanks. Edge case note: the behavior is now unclear when both dates are outside the allowed range.
> gender("James",years=c(1930,1930),method="ipums")
Source: local data frame [1 x 6]
name proportion_male proportion_female gender year_min year_max
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 James 0.9902 0.0098 male 1930 1930
> gender("James",years=c(1960,1980),method="ipums")
Source: local data frame [0 x 6]
Variables not shown: name <chr>, proportion_male <dbl>, proportion_female <dbl>, gender <lgl>, year_min
<dbl>, year_max <dbl>.
Warning message:
In gender("James", years = c(1960, 1980), method = "ipums") :
The year range provided has been trimmed to fit within 1789 to 1930.
Hmm. Good point. As it stands, dates which are completely outside the range of the method will be reset to the entire range of the method. But I suppose it is possible that someone could pass nonsensical dates and get nonsensical answers. I should just report what the dates given were and what the dates actually used are. For that matter, this whole thing should be refactored.
On Sat, Sep 10, 2016 at 3:58 PM, Benjamin Schmidt notifications@github.com wrote:
Super, thanks. Edge case note: the behavior is now unclear when both dates are outside the allowed range.
gender("James",years=c(1930,1930),method="ipums")Source: local data frame [1 x 6]
name proportion_male proportion_female gender year_min year_max
1 James 0.9902 0.0098 male 1930 1930> gender("James",years=c(1960,1980),method="ipums")Source: local data frame [0 x 6] Variables not shown: name , proportion_male , proportion_female , gender , year_min , year_max .Warning message:In gender("James", years = c(1960, 1980), method = "ipums") : The year range provided has been trimmed to fit within 1789 to 1930. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/gender/issues/42#issuecomment-246134264, or mute the thread https://github.com/notifications/unsubscribe-auth/AALNeDr6iPVLmhsqirCDw95RtjfUWMBEks5qowvUgaJpZM4J4hDt .
Lincoln Mullen Assistant Professor, Department of History & Art History George Mason University
Based on the output of gender("James", years = c(1960, 1980), method = "ipums")
, I think it's currently being trimmed to years=c(1960,1910)
, which sails through because it runs after the check whether years is ordered. Guaranteed to return nothing, which isn't the worst possible option.
Yeah, I wasn't thinking clearly about how the range was set for odd inputs. The whole code for setting ranges should be refactored. Will fix.
If I have someone named "Orestes" in 1831, I can't match it in the IPUMS sample
No problem, right? Just broaden the net when you have a rare name
Super. But if I want to do a batch test on many names, I'd like to be able to just set the years for each of them at c(year-30,year+30). But this is going to raise loads of errors for anyone near the edge of the range.
Of course I can muck up my codes with a lot of maxes and mins for each of the datasets I'm using. But why not just clip
c(1788,1818)
toc(1789, 1818)
and write a warning instead of raising an error?