nltk / nltk_data

NLTK Data
1.4k stars 1.03k forks source link

packages/corpora/names2.zip, packages/corpora/names2.xml: creation #181

Closed davidam closed 2 years ago

davidam commented 2 years ago

New corpora package see: https://github.com/nltk/nltk_data/pull/157#issuecomment-1001092230

stevenbird commented 2 years ago

@davidam, there's still some issues here.

  1. Is it right to call this "Names Corpus Version 2.0". What happens if someone makes an extension to the existing "Names Corpus Version 1.3" which is more in keeping with its simple structure? Perhaps we need a better name, and independent versioning.

  2. I think that the README should document the data format, or at least point to available documentation. I looked at https://github.com/davidam/damegender but didn't see anything obvious.

  3. I'm confused about the files male.txt and female.txt, which have the same name but different contents to files in the existing Names corpus. Were they meant to be included? Can you please document their relationship to interall.csv?

  4. Note that the zip file still unpacks into a folder "names", conflicting with the existing corpus, and there is an old readme file README~

davidam commented 2 years ago

I trust on your points of view. Can you send me a patch to my branch? and later I could make the pull request to nltk_data, so we can avoid errors interpreting your comments.

The third point is about how I have built the names you can see the source, but yes I can make documentation in the README

davidam commented 2 years ago

I have made a new pull request. You can feel free to accept or add new changes to my branch as you consider.

Best wishes!

davidam commented 2 years ago

Hi @stevenbird,

Can you give some feedback?

Thanks in advance!

davidam commented 2 years ago

Commenting that I have reached an accuracy of 87.56% with this dataset using the scientific dataset of 7000 names of Lucía Santamaría and Helena Mihaljevic as base of truth.

I am working on improve this results including non latin alphabets. But remember that relase fast is a good Open Source philosophy

stevenbird commented 2 years ago

@davidam: The PR has all of the same issues that I identified previously. I recommend that you distribute this data from your own site, and that you consider working with the authors of the existing corpus to produce an updated version of the names corpus.

stevenbird commented 2 years ago

NB Today is coincidentally non-binary people’s day: https://en.m.wikipedia.org/wiki/International_Non-Binary_People's_Day

Perhaps a moment to reflect on the validity of this classification task.

davidam commented 2 years ago

Ok, thanks for the feedback

davidam commented 1 year ago

My contribution is my dataset https://raw.githubusercontent.com/davidam/damegender/master/src/damegender/files/names/names_inter/interall.csv

Perhaps you are happy with this classification if you think male or female as votes and you reach percentages. You can discover non binary trends in the current with this way.

Regards.

El jue, 14 jul 2022 a las 4:15, Steven Bird @.***>) escribió:

NB Today is coincidentally non-binary people’s day: https://en.m.wikipedia.org/wiki/International_Non-Binary_People's_Day

Perhaps a moment to reflect on the validity of this classification task.

— Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/181#issuecomment-1184082884, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHZZ4J4PFLAMLNT2VZ33MTVT65A7ANCNFSM5KZGVVHA . You are receiving this because you were mentioned.Message ID: @.***>

-- David Arroyo Menéndez http://www.davidam.com