nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Some corpora are unnecessarily unzipped #187

Closed ekaf closed 2 years ago

ekaf commented 2 years ago

The following corpora are marked as unzip="1" in their respective *.xml files, although unzipping them is unnecessary, since they all rely on the Wordnet corpus reader, which works well on zipped data:

omw-1.4.zip omw.zip extended_omw.zip wordnet.zip wordnet2021.zip wordnet31.zip

Wouldn't it be better to mark these packages as unzip="0" then? This would make it easier to use them on smaller devices with limited storage space, which are becoming increasingly popular.

Also, since the generic CorpusReader class supports seamless access to zipped files, it is possible that other corpora as well already work without unzipping. And maybe some remaining corpus readers could be adapted to support zipped data: this could be particularly nice with the two huge Framenet corpora (846 Mb framenet_v17 and 580 Mb framenet_v15).