Fine-tune BERTForMaskedLM

naturecreator commented 3 years ago

Hello,

I am doing a project on spelling correction. I used pre-trained "bert-base-cased" model. However, the results are not that accurate. Therefore, I planned to fine-tune the BERT for Masked LM task. I couldn't find any examples for fine-tuning BERT model for Masked LM. I tried to use "run_language_modeling.py" for fine-tuning. But, I came across with the following error:

C:\Users\ravida6d\spell_correction\transformers\examples\language-modeling>python run_language_modeling.py --output_dir ="C:\\Users\\ravida6d\\spell_correction\\contextualSpellCheck\\fine_tune\\" --model_type = bert --model_name_or_path = bert-base-cased --do_train --train_data_file =$TRAIN_FILE --do_eval --eval_data_file =$TEST_FILE –mlm

C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\training_args.py:291: FutureWarning: The `evaluate_during_training` argument is deprecated in favor of `evaluation_strategy` (which has more options)
  FutureWarning,

Traceback (most recent call last):
  File "run_language_modeling.py", line 313, in <module>
    main()
  File "run_language_modeling.py", line 153, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\hf_argparser.py", line 151, in parse_args_into_dataclasses

    raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}")
ValueError: Some specified arguments are not used by the HfArgumentParser: ['bert', 'bert-base-cased']

I am not understanding how to use this script. Can anyone give some information for understanding the fine-tuning of BERT Masked LM.

LysandreJik commented 3 years ago

Can you try removing spaces between --model_type, = and bert? Same for --model_name_or_path, = and bert-base-cased

naturecreator commented 3 years ago

@LysandreJik Yes, it works now. Thank you :).

I tried the example as it is with the same dataset specified, but, now I am facing GPU out of memory issue. Do you know how can I change the batch size in "run_language_modeling.py". Here is the snippet of the error:

09/29/2020 13:11:35 - INFO - filelock - Lock 2508759984840 acquired on C:\\Users\\ravida6d\\Desktop\\spellcheck\\wikitext\cached_lm_BertTokenizer_510_wiki.train.raw.lock 09/29/2020 13:11:35 - INFO - filelock - Lock 2508759984840 released on C:\\Users\\ravida6d\\Desktop\\spellcheck\\wikitext\cached_lm_BertTokenizer_510_wiki.train.raw.lock 09/29/2020 13:11:35 - INFO - filelock - Lock 2508759984560 acquired on C:\\Users\\ravida6d\\Desktop\\spellcheck\\wikitext\cached_lm_BertTokenizer_510_wiki.test.raw.lock 09/29/2020 13:11:36 - INFO - filelock - Lock 2508759984560 released on C:\\Users\\ravida6d\\Desktop\\spellcheck\\wikitext\cached_lm_BertTokenizer_510_wiki.test.raw.lock C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\spellcheck\lib\site-packages\transformers\trainer.py:266: FutureWarning: Passingprediction_loss_onlyas a keyword argument is deprecated and won't be possible in a future version. Useargs.prediction_loss_onlyinstead. FutureWarning, You are instantiating a Trainer but Tensorboard is not installed. You should consider installing it. Epoch: 0%| | 0/3 [00:00<?, ?it/s] Iteration: 0%| | 0/583 [00:00<?, ?it/s] Iteration: 0%|▏ | 1/583 [00:01<11:16, 1.16s/it]Traceback (most recent call last): File "fine_tune.py", line 313, in <module> main() File "fine_tune.py", line 277, in main trainer.train(model_path=model_path) File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\spellcheck\lib\site-packages\transformers\trainer.py", line 755, in train tr_loss += self.training_step(model, inputs) File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\spellcheck\lib\site-packages\transformers\trainer.py", line 1081, in training_step loss.backward() File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\spellcheck\lib\site-packages\torch\tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\spellcheck\lib\site-packages\torch\autograd\__init__.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 454.00 MiB (GPU 0; 11.00 GiB total capacity; 8.60 GiB already allocated; 132.32 MiB free; 8.70 GiB reserved in total by PyTorch) (malloc at ..\c10\cuda\CUDACachingAllocator.cpp:289) (no backtrace available) Epoch: 0%| | 0/3 [00:01<?, ?it/s] Iteration: 0%|▏ | 1/583 [00:01<13:12, 1.36s/it]

And also I would like to know, which argument defines that we are training or fine-tuning in "run_langauge_modeling.py".

naturecreator commented 3 years ago

Hello @LysandreJik,

I reduced the --per_gpu_train_batch_size to 1, then I could fine-tune the BERT model. The result was stored as pytorch_model.bin. I wanted to load the model using Autotokenizer.from_pretrained class method but I faced this error:

Traceback (most recent call last):
  File "C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/contextualSpellCheck.py", line 587, in <module>
    checker = ContextualSpellCheck(model_name="C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/pytorch_model.bin", debug=True, max_edit_dist=3)
  File "C:/Users/ravida6d/Desktop/Darshan/spell_correction/contextualSpellCheck/contextualSpellCheck.py", line 113, in _init_
    self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\tokenization_auto.py", line 210, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_auto.py", line 303, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_utils.py", line 357, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\site-packages\transformers\configuration_utils.py", line 439, in _dict_from_json_file
    text = reader.read()
  File "C:\Users\ravida6d\AppData\Local\Continuum\anaconda3\envs\contextualSpellCheck\lib\codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Can you please help me with this?

naturecreator commented 3 years ago

I got it worked and the following files must be in the same folder and the path should be projected to the folder (not to the pytorch_model.bin):

vocab.txt - vocabulary file pytorch_model.bin - the Pytorch-compatible (and converted) model config.json - json-based model configuration

naturecreator commented 3 years ago

While fine-tuning, we can only see loss and perplexity which is useful. Is it also possible to see the accuracy of the model and also the tensorboard when using the "run_language_modeling.py" script? It would be really helpful if anyone could explain how the "loss" is calculated for BERTForMaskedLM task (as there are no labels provided while fine-tuning).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ucas010 commented 1 year ago

hi,dear how to use Spelling Error Correction with this rp? could you pls help me ?

huggingface / transformers

Fine-tune BERTForMaskedLM #7432