nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Stale pickled PunktSentenceTokenizer in nltk_data/ #123

Open advgiarc opened 6 years ago

advgiarc commented 6 years ago

It appears the pickled tokenizers are old, and do not contain current code.

https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip

The .zip that is downloaded is older than the source code:

https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py

There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.

`import os import nltk from nltk.tokenize.punkt import PunktSentenceTokenizer

s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website." abbrevs = ['u.s', 'gov'] nltk.data.path.append(f'{os.getcwd()}/nltk_data') nltk.download('punkt', 'nltk_data')

pickled_tokenizer = nltk.data.load('nltk_data/tokenizers/punkt/PY3/english.pickle') pickled_tokenizer._params.abbrev_types.update(abbrevs) print(pickled_tokenizer.sentences_from_text(s))

fresh_tokenizer = PunktSentenceTokenizer() fresh_tokenizer._params.abbrev_types.update(abbrevs) print(fresh_tokenizer.sentences_from_text(s))`