Closed TrigonaMinima closed 2 years ago
Hi, Thank you from doing this. Have you uploaded this in the nltk packages? could you please tell me how I can access the zip or Could you please share me the zip file you have created?
For @s-arvind in the previous issue @TrigonaMinima has give a link to the list of all stopwords of hinglish and hindi: https://github.com/TrigonaMinima/HinglishNLP/blob/master/data/assets/stop_hinglish
Also @TrigonaMinima can you help me incorporate the list of ur stopwords in my nltk package to that i can just use "stopwords.words('Hinglish')".... I am new to this stuff so it would help alot if u can elaborate step by step what to do !?? Thanks
Resolved in https://github.com/nltk/nltk_data/commit/aa54613807a97886516d5f0d13c1374d29bf4257 Sorry for the long delay
This breaks some multi-language applications I was using nltk for. Particularly, determining a string's language based on the frequency of stopwords from each language.
Since the "hinglish" stopword list contains 98,8% of the English stopword list, wouldn't it be best to have an exclusive list of transliterated hindi stopwords and have the user create the union between the two for hinglish analysis?
Continuing the conversation from the issue https://github.com/nltk/nltk/issues/2087.
I have a list of Hindi stopwords gathered from a number of online sources. I can list those sources if needed.
I transliterated (all the versions I could think of for that word) this list of Hindi stopwords into English. Using this list, along with, NLTK's English stopwords list, I created a Hinglish stopwords list.
How can I submit it here? I see a zip of stopwords in
nltk_data/packages/corpora/
. Do I just put it there? Or are there other changes to be done?Also, I understand there needs to be a consensus on this list. Any suggestions of what might work as a consensus on this?