Open aalexandersson opened 2 years ago
ok @aalexandersson , I understand the situation.
I will add this feature as soon as possible, it will be ready in the next version of the package.
Thank you very much for collaborating :)
Thank you :-) Sex/gender classification is difficult. For more background, this Julia Discourse topic discusses the two Julia packages: https://discourse.julialang.org/t/rfc-genderinference-jl/22294
I am aware of these two active areas of research. I will try to stay on top of it: 1) The North American Association of Central Cancer Registries (NAACCR) has a Sex/Gender Classification Workgroup which is putting together a proposal for a standardized data set: https://narrative.naaccr.org/wp-content/uploads/2022/01/Winter-2022-for-PDF.pdf
2) The University of Washington (UW), together with the U.S. Census Bureau, is putting together standardized fake datasets for record linkage: https://www.census.gov/newsroom/blogs/research-matters/2021/10/four-cooperative-agreements.html
The changes have been added to Faker 0.3.5, thank you very much for your collaboration
Describe the bug First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.To Reproduce Steps to reproduce the behavior:
Expected behavior I expected this:
Screenshots Not applicable because not all profiles are problematic.
Environment
Additional context SSA provides national, state-specific, and territory-specific data which perhaps could be used: https://www.ssa.gov/oact/babynames/limits.html
Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.
Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name
There is a Julia package which also might help: NameToGender.jl
Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl