Add a Corpus for the names of medicines in india

nltk / nltk_data

NLTK Data

1.47k stars 1.05k forks source link

Add a Corpus for the names of medicines in india #78

Closed abeepathak96 closed 7 years ago

abeepathak96 commented 7 years ago

drugs.txt This is a text file that contains the names of medicines in India, which are prescribed in India by all the doctors this list is being constantly updated by our team. please add this list as a corpus in the nltk-data so that it can be used in the clinical data parsing.

alvations commented 7 years ago

Why is this file added to the non-breaking prefix?

abeepathak96 commented 7 years ago

That is the list of names of medicines in India which I want to add in nltk as a corpus so that drug named recognition can be performed

On Mon, Jun 5, 2017 at 2:27 AM, alvations notifications@github.com wrote:

Why is this file added to the non-breaking prefix?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/78#issuecomment-306066504, or mute the thread https://github.com/notifications/unsubscribe-auth/AU-EUcSQQFx1hFA0I0Qq98gOt7ZkV4hVks5sAxpSgaJpZM4NvZwa .

alvations commented 7 years ago

It should be added to another corpus and not not the nonbreaking_prefixes. Also, there are other steps that needs to be done to like updating the indices for the all.xml, corpora.xml, etc.

If you're suggesting a new corpus, it should be an issue instead of a pull request (PR) =) https://github.com/nltk/nltk_data/issues

I see that there's already an issue at nltk_data#77 , perhaps someone else would pick it up and add it to the nltk_data.

stevenbird commented 7 years ago

As noted in #77, we don't have capacity to add and maintain custom wordlists like this, sorry. They can be distributed outside of NLTK.