Closed davidam closed 2 years ago
@davidam sorry for the delay. Thanks for your work, which looks really comprehensive. There are some things to fix before proceeding please.
First, I'd like to clarify the relationship between this dataset and the original. Is it a superset of the original? Is there any systematic description of how you obtained the data?
Second, the version numbering is confusing. You're proposing this as 2.0 in the XML file, but the readme says otherwise.
I note that there's a backup of the readme file in the zipfile, which should be removed.
Thanks for any more information.
Hello,
Thanks for your interest in the issue. I will can give you a new version with improvements on some weeks. I've continued working on Open Dataset about names, gender and frequency and I can send you a new pull request.
Best wishes.
@davidam sorry for the delay. Thanks for your work, which looks really comprehensive. There are some things to fix before proceeding please.
Yes, I've a good dataset with names, gender and frequency. The nltk dataset is about a file of males and a files of females. You can decide if you prefer your way or make an update in this sense. If you maintain a file for males and a file for females, we must decide about why a person is classified as male or as female. For example, David is classified as male because 99.7% of people is choosing this gender in the merge of the datasets (international dataset) and Isa could be undefined because the 61.1% is choosing males and the rest is choosing females, Tracy is classified as female because the 81.1% is choosing female. In summary, we can think a range of males, females and undefined if we want create a males file and a female file for NLTK from the datasets retrieved from official statistical institutions.
First, I'd like to clarify the relationship between this dataset and the original. Is it a superset of the original?
Yes, it's a superset.
Is there any systematic description of how you obtained the data?
Yes, from https://github.com/davidam/damegender/tree/master/src/damegender/files/names you can access to subfolders with a README file where I'm explaining how I have retrieved the data.
Second, the version numbering is confusing. You're proposing this as 2.0 in the XML file, but the readme says otherwise.
I note that there's a backup of the readme file in the zipfile, which should be removed.
Thanks for any more information.
Ok.
Hello @davidam !
Some more comments:
GESAMTZAHL PERSONEN (MÄNNER)
and continues with names in some non-alphabetical ordering.GESAMTZAHL PERSONEN (FRAUEN)
and continues with a more arbitrary order.Beyond that, all names have been normalized to upper case, while I believe this wasn't the case previously. Is there a reason for this change?
I suspect that we won't be able to merge this until some of these comments (also by @stevenbird above) are resolved.
Hi,
if you have data on the percentages, it might be nice to introduce a new file 'firstnames' or some such, with the ratios
Something like:
Name Male Female David .997 .003 Isa .611 .389 Tracy .089 .911
That way if people are interested in the distribution then they can use it.
Yours
On Thu, Nov 18, 2021 at 9:26 PM Tom Aarsen @.***> wrote:
Hello!
Some more comments:
- The (newer) README states that the data is alphabetical, but it is only alphabetical for the first 897 male names. After that, it says GESAMTZAHL PERSONEN (MÄNNER) and continues with names in some non-alphabetical ordering.
- The same goes for women. There seem to be about 923 alphabetical names, and then it says GESAMTZAHL PERSONEN (FRAUEN) and continues with a more arbitrary order.
I suspect that we won't be able to merge this until some of these comments (also by @stevenbird https://github.com/stevenbird above) are resolved.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/157#issuecomment-972864337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRR537BZ73BXJ3TCLZ3UMT5IFANCNFSM46DI6WZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University
Today, I've generated a new pull request taking into account the comments.
Thanks in advance!
@davidam – thanks for all your work... this is a significant advance on the existing corpus. I wonder about leaving the old one in place and deprecating it (via its corpus reader), and providing a new reader for the new corpus, with more functionality?
Thanks for your attention,
From my point of view, you can accept the patch in the current state and yes later you can rename the corpus as new and include more ideas from Damegender.
Best wishes.
El lun., 6 de diciembre de 2021 12:19 p. m., Steven Bird < @.***> escribió:
@davidam https://github.com/davidam – thanks for all your work... this is a significant advance on the existing corpus. I wonder about leaving the old one in place and deprecating it (via its corpus reader), and providing a new reader for the new corpus, with more functionality?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/157#issuecomment-986681626, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHZZ4O7CIROEAO4ND5SYATUPSL3LANCNFSM46DI6WZA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
That's only a reminder. Is it possible accept the pull request?
@davidam sorry for the delay. I was tripped up by a couple of things concerning the relationship to the existing corpus: version number is still 2.0, the permissions are changed even though the existing corpus is apparently included, the added file is in quite a different format, providing richer information. Isn't this just a different corpus?
@stevenbird Ok, propose me a new name for the corpus and I can make a new pull request.
Merry christmas
How about names2
(cf existing corpora udhr2
, ptb3
).
Feliz Navidad!
Ok, @stevenbird thanks by the mentoring, I've done the pull request
That's only a reminder. Can you accept the pull request?
I am improving the reproducibility in these days in Damegender, I am writing a command (orig2.py) to download and to process the files from official sources giving the current Damegender files (including the international datasets)
I've downloaded a lot of open data list of names from official statistics https://github.com/davidam/damegender/tree/master/src/damegender/files/names and I've merged it such as is being released with NLTK.