Closed Doragd closed 3 years ago
Hi, If I use the XLMR tokenizer to tokenize, I find there are some unexpected results.
json = { "text": "Anyhow the man comes in.", "tokens": ["Anyhow", "the", "man", "comes", "in", "."], "upos": ["INTJ", "DET", "NOUN", "VERB", "ADP", "PUNCT"], "head": [4, 3, 4, 0, 4, 4], "deprel": ["discourse", "det", "nsubj", "root", "compound:prt", "punct"] }
for mbert:
process_sentence( json['tokens'], json['head'], json['upos'], json['deprel'], bert_tokenizer) >>> (['Any', '##how', 'the', 'man', 'comes', 'in', '.'], [5, 1, 4, 5, 0, 5, 5], ['INTJ', 'INTJ', 'DET', 'NOUN', 'VERB', 'ADP', 'PUNCT'], ['discourse', 'discourse', 'det', 'nsubj', 'root', 'compound:prt', 'punct'])
All is well. However, when using xlmr:
process_sentence( json['tokens'], json['head'], json['upos'], json['deprel'], bert_tokenizer) >>> (['▁Any', 'how', '▁the', '▁man', '▁comes', '▁in', '▁', '.'], [5, 1, 4, 5, 0, 5, 5, 7], ['INTJ', 'INTJ', 'DET', 'NOUN', 'VERB', 'ADP', 'PUNCT', 'PUNCT'], ['discourse', 'discourse', 'det', 'nsubj', 'root', 'compound:prt', 'punct', 'punct'])
There are two repeated punctures because '.' is tokenized to '▁', '.'. It really confuses me.
'.'
'▁', '.'
XLMR is not supported in the current codebase.
Hi, If I use the XLMR tokenizer to tokenize, I find there are some unexpected results.
for mbert:
All is well. However, when using xlmr:
There are two repeated punctures because
'.'
is tokenized to'▁', '.'
. It really confuses me.