nltk / nltk_data

NLTK Data
1.43k stars 1.03k forks source link

Hinglish and Hindi stop-words #120

Closed TrigonaMinima closed 2 years ago

TrigonaMinima commented 6 years ago

Continuing the conversation from the issue https://github.com/nltk/nltk/issues/2087.

I have a list of Hindi stopwords gathered from a number of online sources. I can list those sources if needed.

I transliterated (all the versions I could think of for that word) this list of Hindi stopwords into English. Using this list, along with, NLTK's English stopwords list, I created a Hinglish stopwords list.

How can I submit it here? I see a zip of stopwords in nltk_data/packages/corpora/. Do I just put it there? Or are there other changes to be done?

Also, I understand there needs to be a consensus on this list. Any suggestions of what might work as a consensus on this?

s-arvind commented 5 years ago

Hi, Thank you from doing this. Have you uploaded this in the nltk packages? could you please tell me how I can access the zip or Could you please share me the zip file you have created?

NischayaSharma commented 4 years ago

For @s-arvind in the previous issue @TrigonaMinima has give a link to the list of all stopwords of hinglish and hindi: https://github.com/TrigonaMinima/HinglishNLP/blob/master/data/assets/stop_hinglish

Also @TrigonaMinima can you help me incorporate the list of ur stopwords in my nltk package to that i can just use "stopwords.words('Hinglish')".... I am new to this stuff so it would help alot if u can elaborate step by step what to do !?? Thanks

stevenbird commented 2 years ago

Resolved in https://github.com/nltk/nltk_data/commit/aa54613807a97886516d5f0d13c1374d29bf4257 Sorry for the long delay

Diogovpam commented 1 year ago

This breaks some multi-language applications I was using nltk for. Particularly, determining a string's language based on the frequency of stopwords from each language.

Since the "hinglish" stopword list contains 98,8% of the English stopword list, wouldn't it be best to have an exclusive list of transliterated hindi stopwords and have the user create the union between the two for hinglish analysis?