grammarly / gector

Official implementation of the papers "GECToR – Grammatical Error Correction: Tag, Not Rewrite" (BEA-20) and "Text Simplification by Tagging" (BEA-21)
Apache License 2.0
891 stars 216 forks source link

Model is downloaded every time despite using --pretrain and --pretrain_folder #149

Closed mina1460 closed 2 years ago

mina1460 commented 2 years ago

Hello from Egypt! First of all, I want to thank you for the amazing code and repo you have here. Thank you for this.

I have a problem with continuing training the model from the pre-trained models you uploaded. Here is my command:

!python train.py --model_dir "/content/gdrive/MyDrive/Gector code/models" --train_set "/content/gdrive/MyDrive/Gector code/gector/a1_shuf_train" --dev_set "/content/gdrive/MyDrive/Gector code/gector/a1_shuf_dev" --pretrain_folder "/content/gdrive/MyDrive/Gector code/pretrained_models/" --pretrain bert_0_gectorv2 --special_tokens_fix 0 --transformer_model bert --tune_bert 1 --skip_correct 1 --skip_complex 0 --max_len 50 --batch_size 64 --tag_strategy keep_one --cold_steps_count 0 --cold_lr 1e-3 --lr 1e-5 --predictor_dropout 0.0 --lowercase_tokens 0 --pieces_per_token 5 --vocab_path data/output_vocabulary --label_smoothing 0.0 --patience 3 --n_epoch 20

image

Downloading: 100% 570/570 [00:00<00:00, 678kB/s] Downloading: 100% 208k/208k [00:00<00:00, 1.76MB/s] Downloading: 100% 426k/426k [00:00<00:00, 2.91MB/s] WARNING:root:vocabulary serialization directory /content/gdrive/MyDrive/Gector code/models/vocabulary is not empty Data is loaded

--pretrain_folder and the model name in the --pretrain?

Downloading: 100% 416M/416M [00:11<00:00, 39.2MB/s]

can anyone tell me why is this happening?

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']

Thank you so much

mina1460 commented 2 years ago

Also, I don't quite understand the updates_per_epoch if I had already specified a batch size. from what I understand, if I have 200 samples, and a batch size of 5, this means I am going to go through 40 batches per epoch, or 40 updates_per_epoch. How is it possible to specify both parameters?

mina1460 commented 2 years ago

image the same thing happens with predicting

skurzhanskyi commented 2 years ago

Hi @mina1460, thanks for your interest in the repository. Regarding your questions:

  1. The script still uses the pretrained model you mentioned. It loads the RoBERTa weights first and then updates weights with the paths you provided. It should download it only the first time because of the cashing.
  2. It's not an error, it warning that appears when some parts of the model are initialized randomly. In our case, it's the last predicting layers.
  3. When the amount of data is very big (like during the pre-training stage), we use updates_per_epoch to count epochs.
mina1460 commented 2 years ago

Thank you for your reply.

I just have one more question. If I am using the pretrained models from your repo. I suppose that these models are the results of the training on the 3 stages. So why do I need to download a fresh model from hugging face, if it is right here (tuned and ready) and I am giving the code its path to load it from?

Also, can you please tell me if I need to use the T5 model instead of BERT or RoBerta, would it be as simple as running a special tokenizer for T5? Or are there other complications that I can't see?

Again thank you so much for your help

skurzhanskyi commented 2 years ago
  1. As I mentioned before, it's rather a bug. It downloads default RoBERTa weights.
  2. T5 is an encoder-decoder model (NLG), while here we use encoder-only models (NLU), so I think it won't fit.
mina1460 commented 2 years ago

Thank you so much

arvindpdmn commented 2 years ago

The answer above addresses training, I think. What about prediction? Why do we get these downloads? My guess is that these are downloaded from PyTorch Hub but not sure what are downloaded. I then searched the folder but it's not clear what's installed or where; or is it kept in memory? My prediction command in Google Colab:

!(python predict.py --model_path ./roberta_1_gector.th \
    --vocab_path ./data/output_vocabulary/ \
    --input_file myfile.txt \
    --output_file myfile.corr \
    --transformer_model roberta \
    --special_tokens_fix 1)

What I get:

Downloading: 100% 481/481 [00:00<00:00, 885kB/s]
Downloading: 100% 878k/878k [00:00<00:00, 28.1MB/s]
Downloading: 100% 446k/446k [00:00<00:00, 20.5MB/s]
Downloading: 100% 1.29M/1.29M [00:00<00:00, 29.0MB/s]
Downloading: 100% 478M/478M [00:07<00:00, 68.0MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Produced overall corrections: 48