nltk / nltk_data

NLTK Data
1.5k stars 1.05k forks source link

Extra blank space at the end of nonbreaking_prefix.en at end of line 103 #85

Open advpetc opened 7 years ago

advpetc commented 7 years ago

When testing sentence like: 1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 . 'No' and the dot will be split apart. I check the nonbreaking_prefix.en and remove the blank at the end of line 103, it works as expected.

alvations commented 7 years ago

@advpetc Thank you for reporting the issue.

Could you explain it with a little more detail? It's a little unclear what changes you are proposing.

advpetc commented 7 years ago

Note: If you print the recognized NUMERIC_ONLY_PREFIXES, you wouldn't see No inside the list.

alvations commented 7 years ago

Yes, the MosesTokenizer output in NLTK doesn't correspond to the one from Moses, the NLTK output shouldn't be the expected behavior:

~/mosesdecoder/scripts/tokenizer$ perl tokenizer.perl -l en
Tokenizer Version 1.1
Language: en
Number of threads: 1
1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .
1 Official Records of the General Assembly , Fifty-sixth Session , Supplement No. 21 .

The nonbreakingprefix.en are the same in Moses and nltk_data is the same so the problem should come in the MosesTokenizer code or nltk.corpus.nonbreaking_prefixes:

>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']

>>> nbp.words('en').index('No #NUMERIC_ONLY#')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: 'No #NUMERIC_ONLY#' is not in list
>>> nbp.words('en').index('No #NUMERIC_ONLY# ')
88

It seems like there's an extra space in line =(

advpetc commented 7 years ago

I'm not sure if there are same problems in other nonbreaking_prefix files, if you have time, it might worth to take a look.

alvations commented 7 years ago

Yes, removing the extra space in nonbreaking prefix for the No #NUMERIC_ONLY# line solves the problem.

After removing the extra space:

>>> from nltk.tokenize.moses import MosesTokenizer
moses =>>> moses = MosesTokenizer()
>>> moses.tokenize('1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .' )
[u'1', u'Official', u'Records', u'of', u'the', u'General', u'Assembly', u',', u'Fifty-sixth', u'Session', u',', u'Supplement', u'No.', u'21', u'.']
>>> tokens = moses.tokenize('1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .')
>>> 'No.' in tokens
True

This is because this boolean regex is matching the exact string and nonbreaking_prefix word corpus didn't strip the line:

def has_numeric_only(self, text):
    return bool(re.search(r'(.*)[\s]+(\#NUMERIC_ONLY\#)', text))

Thank you for reporting this!!

alvations commented 7 years ago

A regression test of Moses vs NLTK implementation would be good to test all these kinks =)

I've just checked the nonbreaking_prefixes.en from https://github.com/alvations/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en#L103 there's a space there too. So maybe removing the space from the version in nltk_data might not be a good thing.

This line in mosesdecoder strips the nonbreaking_prefixes before searching for the regex: https://github.com/alvations/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L548

Let's make the changes in the nltk code instead to reflect the perl chomp =)

advpetc commented 7 years ago

Yes, I test it with perl script and show the diff. Actually, there is one more modification need to make. When meeting p.m. or a.m. as the end of the sentence. It will split the last dot apart. I check the source code from the same function handles_nonbreaking_prefixes from moses.py, and it appears that token_ends_with_period = re.search(r'^(\S+)\.$', token) on line 271 doesn't do it job. Please check this for your convenience.