nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Resolve critical installation and usage issue of inaugural data #174

Closed tomaarsen closed 2 years ago

tomaarsen commented 2 years ago

Hello!

I'll keep this brief. #169 added another speech to the Inaugural dataset, but also turned inaugural.zip into a zip with the files directly, rather than a folder called inaugural which contains the files. The latter is how all corpora ought to be. https://github.com/nltk/nltk_data/issues/173#issuecomment-984970634 mentions this. As it turns out, using the most recent nltk does allow installing, but does not allow using inaugural in code.

It is not possible to force the downloader to install inaugural from e.g. tomaarsen/nltk_data, so it's quite tricky to test this PR. That said, the current system simply does not work, so I feel obligated to simply merge this in the hopes that it does indeed resolve the issue.

The new inaugural.zip contains a folder with the files, rather than the files directly. The line endings on the new 2021-Biden.txt were also turned to Unix.

References


tomaarsen commented 2 years ago

Success! The changes seem to work. I experience no more issues on Windows and on Google Colab (Linux) personally.

cc-ing some relevant devs as this might be of interest to you all: @stevenbird @nimbusaeta @pratos