neomatrixcode / Faker.jl

generator of fake data for julia
https://faker.vercel.app/
MIT License
52 stars 11 forks source link

First names are not gender-specific #30

Open aalexandersson opened 2 years ago

aalexandersson commented 2 years ago

Describe the bug First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker: fake.first_name_male() if gender=="M" else fake.first_name_female() But I prefer Julia.

To Reproduce Steps to reproduce the behavior:

julia> using Faker
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "M"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "F"

Expected behavior I expected this:

julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "F"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "M"

Screenshots Not applicable because not all profiles are problematic.

Environment

Additional context SSA provides national, state-specific, and territory-specific data which perhaps could be used: https://www.ssa.gov/oact/babynames/limits.html

Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.

Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name

There is a Julia package which also might help: NameToGender.jl

Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl

neomatrixcode commented 2 years ago

ok @aalexandersson , I understand the situation.

I will add this feature as soon as possible, it will be ready in the next version of the package.

Thank you very much for collaborating :)

aalexandersson commented 2 years ago

Thank you :-) Sex/gender classification is difficult. For more background, this Julia Discourse topic discusses the two Julia packages: https://discourse.julialang.org/t/rfc-genderinference-jl/22294

I am aware of these two active areas of research. I will try to stay on top of it: 1) The North American Association of Central Cancer Registries (NAACCR) has a Sex/Gender Classification Workgroup which is putting together a proposal for a standardized data set: https://narrative.naaccr.org/wp-content/uploads/2022/01/Winter-2022-for-PDF.pdf

2) The University of Washington (UW), together with the U.S. Census Bureau, is putting together standardized fake datasets for record linkage: https://www.census.gov/newsroom/blogs/research-matters/2021/10/four-cooperative-agreements.html

neomatrixcode commented 2 years ago

https://github.com/JuliaRegistries/General/pull/55751

neomatrixcode commented 2 years ago

The changes have been added to Faker 0.3.5, thank you very much for your collaboration