lmullen / gender

Predict Gender from Names Using Historical Data
Other
189 stars 27 forks source link

Be more forgiving of exterior ranges? #42

Open bmschmidt opened 8 years ago

bmschmidt commented 8 years ago

If I have someone named "Orestes" in 1831, I can't match it in the IPUMS sample

> gender("Orestes",years=c(1831),method="ipums")
Source: local data frame [0 x 6]

Variables not shown: name <chr>, proportion_male <dbl>, proportion_female <dbl>, gender <lgl>, year_min

No problem, right? Just broaden the net when you have a rare name

> gender("Orestes",years=c(1821,1841),method="ipums")
Source: local data frame [1 x 6]

     name proportion_male proportion_female gender year_min year_max
    <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
1 Orestes               1                 0   male     1821     1841

Super. But if I want to do a batch test on many names, I'd like to be able to just set the years for each of them at c(year-30,year+30). But this is going to raise loads of errors for anyone near the edge of the range.

> gender("Orestes",years=c(1803-15,1803+15),method="ipums")
Error in gender("Orestes", years = c(1803 - 25, 1803 + 25), method = "ipums") : 
  Please provide a year range between 1789 and 1930.

Of course I can muck up my codes with a lot of maxes and mins for each of the datasets I'm using. But why not just clip c(1788,1818) to c(1789, 1818) and write a warning instead of raising an error?

lmullen commented 8 years ago

I agree that it is better just to trim the range than stop entirely. Can you try the version I just pushed? Note that Orestes doesn't appear until 1828, so this might be a better test:

gender("Orestes",years=c(1920-20,1920+20),method="ipums")
bmschmidt commented 8 years ago

Super, thanks. Edge case note: the behavior is now unclear when both dates are outside the allowed range.

> gender("James",years=c(1930,1930),method="ipums")
Source: local data frame [1 x 6]

   name proportion_male proportion_female gender year_min year_max
  <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
1 James          0.9902            0.0098   male     1930     1930
> gender("James",years=c(1960,1980),method="ipums")
Source: local data frame [0 x 6]

Variables not shown: name <chr>, proportion_male <dbl>, proportion_female <dbl>, gender <lgl>, year_min
  <dbl>, year_max <dbl>.
Warning message:
In gender("James", years = c(1960, 1980), method = "ipums") :
  The year range provided has been trimmed to fit within 1789 to 1930.
lmullen commented 8 years ago

Hmm. Good point. As it stands, dates which are completely outside the range of the method will be reset to the entire range of the method. But I suppose it is possible that someone could pass nonsensical dates and get nonsensical answers. I should just report what the dates given were and what the dates actually used are. For that matter, this whole thing should be refactored.

On Sat, Sep 10, 2016 at 3:58 PM, Benjamin Schmidt notifications@github.com wrote:

Super, thanks. Edge case note: the behavior is now unclear when both dates are outside the allowed range.

gender("James",years=c(1930,1930),method="ipums")Source: local data frame [1 x 6]

name proportion_male proportion_female gender year_min year_max

1 James 0.9902 0.0098 male 1930 1930> gender("James",years=c(1960,1980),method="ipums")Source: local data frame [0 x 6] Variables not shown: name , proportion_male , proportion_female , gender , year_min , year_max .Warning message:In gender("James", years = c(1960, 1980), method = "ipums") : The year range provided has been trimmed to fit within 1789 to 1930. — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ropensci/gender/issues/42#issuecomment-246134264, or mute the thread https://github.com/notifications/unsubscribe-auth/AALNeDr6iPVLmhsqirCDw95RtjfUWMBEks5qowvUgaJpZM4J4hDt .

Lincoln Mullen Assistant Professor, Department of History & Art History George Mason University

bmschmidt commented 8 years ago

Based on the output of gender("James", years = c(1960, 1980), method = "ipums"), I think it's currently being trimmed to years=c(1960,1910), which sails through because it runs after the check whether years is ordered. Guaranteed to return nothing, which isn't the worst possible option.

lmullen commented 8 years ago

Yeah, I wasn't thinking clearly about how the range was set for odd inputs. The whole code for setting ranges should be refactored. Will fix.