wasiahmad / Syntax-MBERT

Official code of our work, Syntax-augmented Multilingual BERT for Cross-lingual Transfer [ACL 2021].
GNU General Public License v3.0
16 stars 3 forks source link

Can this method support XLM-roberta? #3

Closed Doragd closed 3 years ago

Doragd commented 3 years ago

Hi, If I use the XLMR tokenizer to tokenize, I find there are some unexpected results.

json = {
        "text": "Anyhow the man comes in.", 
        "tokens": ["Anyhow", "the", "man", "comes", "in", "."], 
        "upos": ["INTJ", "DET", "NOUN", "VERB", "ADP", "PUNCT"], 
        "head": [4, 3, 4, 0, 4, 4], 
        "deprel": ["discourse", "det", "nsubj", "root", "compound:prt", "punct"]
}

for mbert:

process_sentence(
    json['tokens'],
    json['head'],
    json['upos'],
    json['deprel'], bert_tokenizer)
>>>
(['Any', '##how', 'the', 'man', 'comes', 'in', '.'],
 [5, 1, 4, 5, 0, 5, 5],
 ['INTJ', 'INTJ', 'DET', 'NOUN', 'VERB', 'ADP', 'PUNCT'],
 ['discourse', 'discourse', 'det', 'nsubj', 'root', 'compound:prt', 'punct'])

All is well. However, when using xlmr:

process_sentence(
    json['tokens'],
    json['head'],
    json['upos'],
    json['deprel'], bert_tokenizer)
>>>
(['▁Any', 'how', '▁the', '▁man', '▁comes', '▁in', '▁', '.'],
 [5, 1, 4, 5, 0, 5, 5, 7],
 ['INTJ', 'INTJ', 'DET', 'NOUN', 'VERB', 'ADP', 'PUNCT', 'PUNCT'],
 ['discourse', 'discourse', 'det', 'nsubj', 'root', 'compound:prt',  'punct', 'punct'])

There are two repeated punctures because '.' is tokenized to '▁', '.'. It really confuses me.

wasiahmad commented 3 years ago

XLMR is not supported in the current codebase.