Open advpetc opened 7 years ago
@advpetc Thank you for reporting the issue.
Could you explain it with a little more detail? It's a little unclear what changes you are proposing.
the input sentence is: 1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .
the expected output: u'1 Official Records of the General Assembly , Fifty-sixth Session , Supplement No. 21 .
I'm using tokenize function under MosesTokenizer
real output: u'1 Official Records of the General Assembly , Fifty-sixth Session , Supplement No . 21 .
notice the No . in the end (I am expecting No. instead.)
Note: If you print the recognized NUMERIC_ONLY_PREFIXES, you wouldn't see No inside the list.
Yes, the MosesTokenizer
output in NLTK doesn't correspond to the one from Moses, the NLTK output shouldn't be the expected behavior:
~/mosesdecoder/scripts/tokenizer$ perl tokenizer.perl -l en
Tokenizer Version 1.1
Language: en
Number of threads: 1
1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .
1 Official Records of the General Assembly , Fifty-sixth Session , Supplement No. 21 .
The nonbreakingprefix.en
are the same in Moses and nltk_data
is the same so the problem should come in the MosesTokenizer
code or nltk.corpus.nonbreaking_prefixes
:
>>> from nltk.corpus import nonbreaking_prefixes as nbp
>>> nbp.words('en')
[u'A', u'B', u'C', u'D', u'E', u'F', u'G', u'H', u'I', u'J', u'K', u'L', u'M', u'N', u'O', u'P', u'Q', u'R', u'S', u'T', u'U', u'V', u'W', u'X', u'Y', u'Z', u'Adj', u'Adm', u'Adv', u'Asst', u'Bart', u'Bldg', u'Brig', u'Bros', u'Capt', u'Cmdr', u'Col', u'Comdr', u'Con', u'Corp', u'Cpl', u'DR', u'Dr', u'Drs', u'Ens', u'Gen', u'Gov', u'Hon', u'Hr', u'Hosp', u'Insp', u'Lt', u'MM', u'MR', u'MRS', u'MS', u'Maj', u'Messrs', u'Mlle', u'Mme', u'Mr', u'Mrs', u'Ms', u'Msgr', u'Op', u'Ord', u'Pfc', u'Ph', u'Prof', u'Pvt', u'Rep', u'Reps', u'Res', u'Rev', u'Rt', u'Sen', u'Sens', u'Sfc', u'Sgt', u'Sr', u'St', u'Supt', u'Surg', u'v', u'vs', u'i.e', u'rev', u'e.g', u'No #NUMERIC_ONLY# ', u'Nos', u'Art #NUMERIC_ONLY#', u'Nr', u'pp #NUMERIC_ONLY#', u'Jan', u'Feb', u'Mar', u'Apr', u'Jun', u'Jul', u'Aug', u'Sep', u'Oct', u'Nov', u'Dec']
>>> nbp.words('en').index('No #NUMERIC_ONLY#')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: 'No #NUMERIC_ONLY#' is not in list
>>> nbp.words('en').index('No #NUMERIC_ONLY# ')
88
It seems like there's an extra space in line =(
I'm not sure if there are same problems in other nonbreaking_prefix files, if you have time, it might worth to take a look.
Yes, removing the extra space in nonbreaking prefix for the No #NUMERIC_ONLY#
line solves the problem.
After removing the extra space:
>>> from nltk.tokenize.moses import MosesTokenizer
moses =>>> moses = MosesTokenizer()
>>> moses.tokenize('1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .' )
[u'1', u'Official', u'Records', u'of', u'the', u'General', u'Assembly', u',', u'Fifty-sixth', u'Session', u',', u'Supplement', u'No.', u'21', u'.']
>>> tokens = moses.tokenize('1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .')
>>> 'No.' in tokens
True
This is because this boolean regex is matching the exact string and nonbreaking_prefix word corpus didn't strip the line:
def has_numeric_only(self, text):
return bool(re.search(r'(.*)[\s]+(\#NUMERIC_ONLY\#)', text))
Thank you for reporting this!!
A regression test of Moses vs NLTK implementation would be good to test all these kinks =)
I've just checked the nonbreaking_prefixes.en
from https://github.com/alvations/mosesdecoder/blob/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en#L103 there's a space there too. So maybe removing the space from the version in nltk_data
might not be a good thing.
This line in mosesdecoder strips the nonbreaking_prefixes before searching for the regex: https://github.com/alvations/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L548
Let's make the changes in the nltk
code instead to reflect the perl chomp
=)
Yes, I test it with perl script and show the diff. Actually, there is one more modification need to make.
When meeting p.m. or a.m. as the end of the sentence. It will split the last dot apart. I check the source code from the same function handles_nonbreaking_prefixes from moses.py, and it appears that token_ends_with_period = re.search(r'^(\S+)\.$', token)
on line 271 doesn't do it job. Please check this for your convenience.
When testing sentence like:
1 Official Records of the General Assembly, Fifty-sixth Session, Supplement No. 21 .
'No' and the dot will be split apart. I check the nonbreaking_prefix.en and remove the blank at the end of line 103, it works as expected.