Add tools/build_collections.py which automatically builds the all, all-nltk and all-corpora collections, including an (automated) update for these collections.
Edited the popular collection to add omw-1.4 and wordnet2021.
Re-built the index.xml, with the most recent NLTK version which ensures that index.xml is fully sorted.
The collection building script
The script is really simple. It uses the (default) Python glob module to find all .xml files in certain folders, and then creates an XML object that corresponds to exactly the desired format, and then writes that to the collections folder.
I've added it to the Makefile so it's run every time we perform an update.
I noticed this was necessary, as I experienced that omw-1.4 was not included in the all collection when I was pushing NLTK 3.6.7. This script should ensure that all .xml files in the packages folder are included in all and all-nltk, and does the same for all-corpora. This has caused the following changes:
The collection updates
The popular collection:
Added omw-1.4
Added wordnet2021
The all collection:
Added basque_grammars
Added bllip_wsj_no_aux
Added mte_teip5
Added omw-1.4
Added pe08
Added wordnet2021
The all-nltk collection:
Added basque_grammars
Added bllip_wsj_no_aux
Added dolch
Added mte_teip5
Added omw-1.4
Added pe08
Added wordnet2021
The all-corpora collection:
Added comparative_sentences
Added europarl_raw
Added omw-1.4
Added opinion_lexicon
Added pe08
Added product_reviews_1
Added product_reviews_2
Added pros_cons
Added sentence_polarity
Added smultron
Added subjectivity
Added twitter_samples
Added wordnet2021
The popular collection was modified manually, while the others were done automatically through tools/build_collections.py.
The index.xml
The index should, from now onwards, be sorted by id. This should allow small changes to the index.xml to be easy to recognise, preventing packages from quietly being removed from the index accidentally.
This PR is somewhat big, so I'll leave it to let others review it before it's merged.
Hello!
Pull request overview
tools/build_collections.py
which automatically builds theall
,all-nltk
andall-corpora
collections, including an (automated) update for these collections.popular
collection to addomw-1.4
andwordnet2021
.index.xml
, with the most recent NLTK version which ensures thatindex.xml
is fully sorted.The collection building script
The script is really simple. It uses the (default) Python
glob
module to find all.xml
files in certain folders, and then creates an XML object that corresponds to exactly the desired format, and then writes that to thecollections
folder. I've added it to the Makefile so it's run every time we perform an update.I noticed this was necessary, as I experienced that
omw-1.4
was not included in theall
collection when I was pushing NLTK 3.6.7. This script should ensure that all .xml files in thepackages
folder are included inall
andall-nltk
, and does the same forall-corpora
. This has caused the following changes:The collection updates
The
popular
collection:omw-1.4
wordnet2021
The
all
collection:basque_grammars
bllip_wsj_no_aux
mte_teip5
omw-1.4
pe08
wordnet2021
The
all-nltk
collection:basque_grammars
bllip_wsj_no_aux
dolch
mte_teip5
omw-1.4
pe08
wordnet2021
The
all-corpora
collection:comparative_sentences
europarl_raw
omw-1.4
opinion_lexicon
pe08
product_reviews_1
product_reviews_2
pros_cons
sentence_polarity
smultron
subjectivity
twitter_samples
wordnet2021
The
popular
collection was modified manually, while the others were done automatically throughtools/build_collections.py
.The
index.xml
The index should, from now onwards, be sorted by
id
. This should allow small changes to theindex.xml
to be easy to recognise, preventing packages from quietly being removed from the index accidentally.This PR is somewhat big, so I'll leave it to let others review it before it's merged.