nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

names.zip, names.xml: adding more names #157

Closed davidam closed 2 years ago

davidam commented 3 years ago

I've downloaded a lot of open data list of names from official statistics https://github.com/davidam/damegender/tree/master/src/damegender/files/names and I've merged it such as is being released with NLTK.

stevenbird commented 2 years ago

@davidam sorry for the delay. Thanks for your work, which looks really comprehensive. There are some things to fix before proceeding please.

First, I'd like to clarify the relationship between this dataset and the original. Is it a superset of the original? Is there any systematic description of how you obtained the data?

Second, the version numbering is confusing. You're proposing this as 2.0 in the XML file, but the readme says otherwise.

I note that there's a backup of the readme file in the zipfile, which should be removed.

Thanks for any more information.

davidam commented 2 years ago

Hello,

Thanks for your interest in the issue. I will can give you a new version with improvements on some weeks. I've continued working on Open Dataset about names, gender and frequency and I can send you a new pull request.

Best wishes.

davidam commented 2 years ago

@davidam sorry for the delay. Thanks for your work, which looks really comprehensive. There are some things to fix before proceeding please.

Yes, I've a good dataset with names, gender and frequency. The nltk dataset is about a file of males and a files of females. You can decide if you prefer your way or make an update in this sense. If you maintain a file for males and a file for females, we must decide about why a person is classified as male or as female. For example, David is classified as male because 99.7% of people is choosing this gender in the merge of the datasets (international dataset) and Isa could be undefined because the 61.1% is choosing males and the rest is choosing females, Tracy is classified as female because the 81.1% is choosing female. In summary, we can think a range of males, females and undefined if we want create a males file and a female file for NLTK from the datasets retrieved from official statistical institutions.

First, I'd like to clarify the relationship between this dataset and the original. Is it a superset of the original?

Yes, it's a superset.

Is there any systematic description of how you obtained the data?

Yes, from https://github.com/davidam/damegender/tree/master/src/damegender/files/names you can access to subfolders with a README file where I'm explaining how I have retrieved the data.

Second, the version numbering is confusing. You're proposing this as 2.0 in the XML file, but the readme says otherwise.

I note that there's a backup of the readme file in the zipfile, which should be removed.

Thanks for any more information.

Ok.

tomaarsen commented 2 years ago

Hello @davidam !

Some more comments:

Beyond that, all names have been normalized to upper case, while I believe this wasn't the case previously. Is there a reason for this change?

I suspect that we won't be able to merge this until some of these comments (also by @stevenbird above) are resolved.

fcbond commented 2 years ago

Hi,

if you have data on the percentages, it might be nice to introduce a new file 'firstnames' or some such, with the ratios

Something like:

Name Male Female David .997 .003 Isa .611 .389 Tracy .089 .911

That way if people are interested in the distribution then they can use it.

Yours

On Thu, Nov 18, 2021 at 9:26 PM Tom Aarsen @.***> wrote:

Hello!

Some more comments:

  • The (newer) README states that the data is alphabetical, but it is only alphabetical for the first 897 male names. After that, it says GESAMTZAHL PERSONEN (MÄNNER) and continues with names in some non-alphabetical ordering.
  • The same goes for women. There seem to be about 923 alphabetical names, and then it says GESAMTZAHL PERSONEN (FRAUEN) and continues with a more arbitrary order.

I suspect that we won't be able to merge this until some of these comments (also by @stevenbird https://github.com/stevenbird above) are resolved.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/157#issuecomment-972864337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR537BZ73BXJ3TCLZ3UMT5IFANCNFSM46DI6WZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

davidam commented 2 years ago

Today, I've generated a new pull request taking into account the comments.

Thanks in advance!

stevenbird commented 2 years ago

@davidam – thanks for all your work... this is a significant advance on the existing corpus. I wonder about leaving the old one in place and deprecating it (via its corpus reader), and providing a new reader for the new corpus, with more functionality?

davidam commented 2 years ago

Thanks for your attention,

From my point of view, you can accept the patch in the current state and yes later you can rename the corpus as new and include more ideas from Damegender.

Best wishes.

El lun., 6 de diciembre de 2021 12:19 p. m., Steven Bird < @.***> escribió:

@davidam https://github.com/davidam – thanks for all your work... this is a significant advance on the existing corpus. I wonder about leaving the old one in place and deprecating it (via its corpus reader), and providing a new reader for the new corpus, with more functionality?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/157#issuecomment-986681626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHZZ4O7CIROEAO4ND5SYATUPSL3LANCNFSM46DI6WZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

davidam commented 2 years ago

That's only a reminder. Is it possible accept the pull request?

stevenbird commented 2 years ago

@davidam sorry for the delay. I was tripped up by a couple of things concerning the relationship to the existing corpus: version number is still 2.0, the permissions are changed even though the existing corpus is apparently included, the added file is in quite a different format, providing richer information. Isn't this just a different corpus?

davidam commented 2 years ago

@stevenbird Ok, propose me a new name for the corpus and I can make a new pull request.

Merry christmas

stevenbird commented 2 years ago

How about names2 (cf existing corpora udhr2, ptb3).

Feliz Navidad!

davidam commented 2 years ago

Ok, @stevenbird thanks by the mentoring, I've done the pull request

davidam commented 2 years ago

That's only a reminder. Can you accept the pull request?

I am improving the reproducibility in these days in Damegender, I am writing a command (orig2.py) to download and to process the files from official sources giving the current Damegender files (including the international datasets)