There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.
`import os
import nltk
from nltk.tokenize.punkt import PunktSentenceTokenizer
s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website."
abbrevs = ['u.s', 'gov']
nltk.data.path.append(f'{os.getcwd()}/nltk_data')
nltk.download('punkt', 'nltk_data')
It appears the pickled tokenizers are old, and do not contain current code.
https://github.com/nltk/nltk_data/blob/gh-pages/packages/tokenizers/punkt.zip
The .zip that is downloaded is older than the source code:
https://github.com/nltk/nltk/blob/develop/nltk/tokenize/punkt.py
There are a few changes in punkt.py since the .zip was created that seem to improve the tokenization of sentences around abbreviations.
`import os import nltk from nltk.tokenize.punkt import PunktSentenceTokenizer
s = "Alabama Gov. Kay Ivey was asked this morning if she supported the confirmation of U.S. Circuit Judge Brett Kavanaugh to the Supreme Court. Ivey spoke with reporters this morning after a press conference about the state's new Security Operations Center and cybersecurity website." abbrevs = ['u.s', 'gov'] nltk.data.path.append(f'{os.getcwd()}/nltk_data') nltk.download('punkt', 'nltk_data')
pickled_tokenizer = nltk.data.load('nltk_data/tokenizers/punkt/PY3/english.pickle') pickled_tokenizer._params.abbrev_types.update(abbrevs) print(pickled_tokenizer.sentences_from_text(s))
fresh_tokenizer = PunktSentenceTokenizer() fresh_tokenizer._params.abbrev_types.update(abbrevs) print(fresh_tokenizer.sentences_from_text(s))`