nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

PunktParameters stored as tab files #215

Closed ekaf closed 2 months ago

ekaf commented 2 months ago

This package replaces the pickled Punkt models by PunktParameters stored in tab files.

It seems that nltk.data loads Yaml and Json in a safe way, but the Tab format may be preferable, as it is more concise, clearer to read, and probably even safer.

ekaf commented 2 months ago

@stevenbird, this package doesn't disturb anything, and it is needed for testing the new Punkt Tokenizer.

ekaf commented 2 months ago

@stevenbird, the index was not rebuilt after merging this PR. As a consequence, the plaintext corpus reader fails to initialize a sent_tokenizer, so nltk can't even start.