nltk / nltk_data

NLTK Data
1.44k stars 1.04k forks source link

Added stop words for Nepali language #83

Closed sndsabin closed 6 years ago

sndsabin commented 7 years ago

The stop words for Nepali Language was added.

alvations commented 7 years ago

@sndsabin Thank you for suggesting the list of Nepali stopwords. Do you have a reference for the stopwords so that the source of the stopword list can be properly documented?

Also, to add to the nltk_data repo, you would need to regenerate the index.xml so that the hash for the new zipball is recorded appropriately.

E.g. :

# Create a new direcotry to avoid clashes with the actual nltk_data directory 
# that the nltk code uses and also avoid clashes with old version of the nltk_data repo.
mkdir git-repos && cd git-repos

# Re-cloning the github, this might take some time.
git clone https://github.com/sndsabin/nltk_data.git

# Move the corpora subdirectory
cd /nltk_data/packages/corpora/

# Checkout the gh-pages branch
git checkout gh-pages

# Replace the stopwords.zip with the new zipball
rm stopwords.zip
cp /path/to/new/with/nepali/stopwords.zip .

# Recreate the index.xml
cd ../..
make

# Git add, commit, push.
git add packages/corpora/stopwords.zip
git add index.xml
git commit -m 'Added nepali stopwords'
git push
sndsabin commented 7 years ago

The 'index.xml' file was regenerated.

The Stop words for Nepali Languages was compiled from various sources and some were added manually.

alvations commented 7 years ago

@sndsabin You'll have to commit and push the regenerated index.xml too =)

It'll be helpful if you could list the various sources so as to attribute the people who created them.

stevenbird commented 6 years ago

Note that committing a modified zipfile like this will clobber another recent addition to the stopwords corpus for arabic.

stevenbird commented 6 years ago

Is there a more authoritative source for these stopwords?

sndsabin commented 6 years ago

@stevenbird Yes, Madan Puraskar Pustakalaya is one .

stevenbird commented 6 years ago

@sndsabin - is there a URL for the stopwords data?

sndsabin commented 6 years ago

@stevenbird Unfortunately No. The stopwords list was compiled from various sources: research projects documentation being one

stevenbird commented 6 years ago

Thanks @sndsabin

sndsabin commented 6 years ago

welcome @stevenbird :)