Closed davidam closed 2 years ago
@davidam, there's still some issues here.
Is it right to call this "Names Corpus Version 2.0". What happens if someone makes an extension to the existing "Names Corpus Version 1.3" which is more in keeping with its simple structure? Perhaps we need a better name, and independent versioning.
I think that the README should document the data format, or at least point to available documentation. I looked at https://github.com/davidam/damegender but didn't see anything obvious.
I'm confused about the files male.txt
and female.txt
, which have the same name but different contents to files in the existing Names corpus. Were they meant to be included? Can you please document their relationship to interall.csv?
Note that the zip file still unpacks into a folder "names", conflicting with the existing corpus, and there is an old readme file README~
I trust on your points of view. Can you send me a patch to my branch? and later I could make the pull request to nltk_data, so we can avoid errors interpreting your comments.
The third point is about how I have built the names you can see the source, but yes I can make documentation in the README
I have made a new pull request. You can feel free to accept or add new changes to my branch as you consider.
Best wishes!
Hi @stevenbird,
Can you give some feedback?
Thanks in advance!
Commenting that I have reached an accuracy of 87.56% with this dataset using the scientific dataset of 7000 names of Lucía Santamaría and Helena Mihaljevic as base of truth.
I am working on improve this results including non latin alphabets. But remember that relase fast is a good Open Source philosophy
@davidam: The PR has all of the same issues that I identified previously. I recommend that you distribute this data from your own site, and that you consider working with the authors of the existing corpus to produce an updated version of the names corpus.
NB Today is coincidentally non-binary people’s day: https://en.m.wikipedia.org/wiki/International_Non-Binary_People's_Day
Perhaps a moment to reflect on the validity of this classification task.
Ok, thanks for the feedback
My contribution is my dataset https://raw.githubusercontent.com/davidam/damegender/master/src/damegender/files/names/names_inter/interall.csv
Perhaps you are happy with this classification if you think male or female as votes and you reach percentages. You can discover non binary trends in the current with this way.
Regards.
El jue, 14 jul 2022 a las 4:15, Steven Bird @.***>) escribió:
NB Today is coincidentally non-binary people’s day: https://en.m.wikipedia.org/wiki/International_Non-Binary_People's_Day
Perhaps a moment to reflect on the validity of this classification task.
— Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/181#issuecomment-1184082884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHZZ4J4PFLAMLNT2VZ33MTVT65A7ANCNFSM5KZGVVHA . You are receiving this because you were mentioned.Message ID: @.***>
-- David Arroyo Menéndez http://www.davidam.com
New corpora package see: https://github.com/nltk/nltk_data/pull/157#issuecomment-1001092230