neuspell / neuspell

NeuSpell: A Neural Spelling Correction Toolkit
MIT License
659 stars 100 forks source link

Code breaks using different model like roberta, xlm-roberta in finetuning #61

Open Tacacs-1101 opened 2 years ago

Tacacs-1101 commented 2 years ago

Code breaks using a different model other than BERT. I debugged into the code and found that the code is written with respect to BERT tokenizer only while the tokenizers of other transformer models are different. Below snippet in helpers.py

if BERT_TOKENIZER is None:  # gets initialized during the first call to this method
    if bert_pretrained_name_or_path:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained(bert_pretrained_name_or_path)
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False
    else:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained('bert-base-cased')
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False
UthpalaIsiru commented 1 year ago

I also got the same issue. Anyone resolved this?

RK-BAKU commented 1 year ago

Code breaks using a different model other than BERT. I debugged into the code and found that the code is written with respect to BERT tokenizer only while the tokenizers of other transformer models are different. Below snippet in helpers.py

if BERT_TOKENIZER is None:  # gets initialized during the first call to this method
    if bert_pretrained_name_or_path:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained(bert_pretrained_name_or_path)
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False
    else:
        BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained('bert-base-cased')
        BERT_TOKENIZER.do_basic_tokenize = True
        BERT_TOKENIZER.tokenize_chinese_chars = False

Faced with the same issue. This is a quick fix:

BERT_TOKENIZER = transformers.BertTokenizer.from_pretrained(bert_pretrained_name_or_path)

Replace BertTokenizer with XLMRobertaTokenizer:

BERT_TOKENIZER = transformers.XLMRobertaTokenizer.from_pretrained(bert_pretrained_name_or_path)