nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Add script to automatically build critical collections #182

Closed tomaarsen closed 2 years ago

tomaarsen commented 2 years ago

Hello!

Pull request overview

The collection building script

The script is really simple. It uses the (default) Python glob module to find all .xml files in certain folders, and then creates an XML object that corresponds to exactly the desired format, and then writes that to the collections folder. I've added it to the Makefile so it's run every time we perform an update.

I noticed this was necessary, as I experienced that omw-1.4 was not included in the all collection when I was pushing NLTK 3.6.7. This script should ensure that all .xml files in the packages folder are included in all and all-nltk, and does the same for all-corpora. This has caused the following changes:

The collection updates

The popular collection:

The all collection:

The all-nltk collection:

The all-corpora collection:

The popular collection was modified manually, while the others were done automatically through tools/build_collections.py.

The index.xml

The index should, from now onwards, be sorted by id. This should allow small changes to the index.xml to be easy to recognise, preventing packages from quietly being removed from the index accidentally.

This PR is somewhat big, so I'll leave it to let others review it before it's merged.

stevenbird commented 2 years ago

👍 @tomaarsen