prajdabre / yanmtt

Yet Another Neural Machine Translation Toolkit
MIT License
172 stars 32 forks source link

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed #35

Open raullese opened 2 years ago

raullese commented 2 years ago

Hi, When I use the train_mbart_model.sh to get further pre train based on the mBart-large-50 from this:https://huggingface.co/facebook/mbart-large-50;

And when I run in single GPU, there is no problem, but when I set this form and want to run on 2-GPUS

export CUDA_VISIBLE_DEVICES=0,1 # Change to the GPU ID corresponding to a GPU that is free. nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en --mono_src examples/test_data/test_mbart_train.en --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 128 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --shard_files > gen_model/mbart/run_train.log 2>&1 & there is error as follow:

Number of model parameters: 610879488 Total number of params to be optimized are: 610879488 Percentage of parameters to be optimized: 100.0 Initial LR is: 1.25e-07 Training from official pretrained model /miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please use get_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " /miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) Traceback (most recent call last): File "pretrain_nmt.py", line 968, in run_demo() File "pretrain_nmt.py", line 965, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "**/mt_mbart/yanmtt/pretrain_nmt.py", line 313, in model_create_load_run_save checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location) File "**/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 594, in load return _load(opened_zipfile, map_location, pickle_module, pickle_load_args) File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 853, in _load result = unpickler.load() File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 845, in persistent_load load_tensor(data_type, size, key, _maybe_decode_ascii(location)) File "****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/serialization.py", line 833, in load_tensor storage = zip_file.get_storage_from_record(name, size, dtype).storage() RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading file data/94862161006912: file read failed

Can you help me for this problem? I spent a long time and didn't solve it

prajdabre commented 2 years ago

Hi

It's likely that you have a corrupted model file. This can happen if training was terminated while the checkpoint was being saved.

Solutions:

  1. On the terminal, try using torch.load on the model checkpoint to see if it loads. If it doesn't that's the issue.
  2. Inspect your previous pretraining logs to see whether everything was saved properly or not. Check the model file sizes and see if it looks ok or not.
  3. Try to use the model with the pure_model suffix. Hopefully it's not corrupt.
raullese commented 2 years ago

@prajdabre oh Thanks for your patience, and In fact, I have guessed before that it is the problem that the model is not fully saved. And I saw that If saved models successfully, in the pretrain_nmt.py script, will save two models, one is MyModel, which have state_dict、optimizer、scheduler and ctr; and the scale is very big as 6.9GB, after further pre trained with mBart-50; And the other model is MyModel.pure_model, which only have state_dict, similar as open source pre trained model's form; this model's scale is 2.3GB;

After I confirmed, the problem is that, If I run the train_mbart_model.sh in multiple GPUs, the big model(6.9GB) is not saved completely , but the small one is save completely and I can reload with pure_model;

And as I understand, the big model(6.9GB) is actually no need to save? Can I save and reload only pure_model every 1000 steps? and as for the optimizer and scheduler, I can also use them from previous step? Maybe this way can resolve my error in multiple GPUs; and finally, I only save one(big model) in last step. Can I make the above changes? Or have other suggestions?

Actually I don't understand the reason for saving this large model

prajdabre commented 2 years ago

Hi

I'm not sure of the exact problem but I typically pass the flag: --save_intermediate_checkpoints

This will save a separate checkpoint every 10k iterations and this 10k can be set to any value using another flag --long_save_every . I then use an appropriate checkpoint.

I have never used the last checkpoint to be honest. I should look whether there's a bug in the last checkpoint saving or not. That being said using the .pure_model for fine tuning is not a problem. I designed the training so that the big checkpoint with the optimizer and scheduler and counter can be used to resume training a failed run. The pure_model checkpoint is the one that should be used for fine tuning on a downstream task where optimizer params are not needed or for sharing with someone or for uploading to huggingface.

Hope this makes sense.

raullese commented 2 years ago

@prajdabre Thank you for your reply, After many adjustments, I finally decide to Remove some intermediately model_load code parts as follows checkpoint_dict = torch.load(CHECKPOINT_PATH, map_location=map_location) model.load_state_dict(checkpoint_dict['model']) optimizer.load_state_dict(checkpoint_dict['optimizer']) scheduler.load_state_dict(checkpoint_dict['scheduler']) del checkpoint_dict and only keep the part of torch.save(); Hope this is ok

And another point that I haven't understand is that where is the stopping mechanism of this pre-training stage adjusted? I don't seem to find a parameter that can control the number of steps in training

prajdabre commented 2 years ago

Hi

Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.

The argument you are looking for is: --num_batches

raullese commented 2 years ago

Hi

Your modification is ok. There's no real reason to load a saved model. I just wanted to make sure that everything is hard synchronized.

The argument you are looking for is: --num_batches

Thanks a lot

raullese commented 2 years ago

@prajdabre by the way, I just want to ask that when I pre train based on more than one language, such as 8 langs, how about the settings? Are there any other parameters to be aware of? for example, I see that parser.add_argument('--num_domains_for_domain_classifier', type=int, default=1, help='If we have multiple domains then we should set this to a value higher than one.')

Is it necessary to adjust this parameter num_domains_for_domain_classifier to be the same as the number of languages?

because my script errored out again, when I switched to multilingual pre-training, from one lang to 8 langs; and this time's error is :

****/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:247: UserWarning: To get the last learning rate computed by the scheduler, please useget_last_lr(). warnings.warn("To get the last learning rate computed by the scheduler, " ****miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call oflr_scheduler.step()beforeoptimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()beforelr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) /miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:216: UserWarning: Please also save or load the state of the optimizer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) Traceback (most recent call last): File "pretrain_nmt_new.py", line 970, in run_demo() File "pretrain_nmt_new.py", line 967, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "*/miniconda3/envs/yanmtt/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "**/mt_mbart/yanmtt/pretrain_nmt_new.py", line 523, in model_create_load_run_save lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id File "****/mt_mbart/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss smooth_loss.maskedfill(pad_mask, 0.0) RuntimeError: The expanded size of the tensor (27) must match the existing size (26) at non-singleton dimension 1. Target sizes: [15, 27, 1]. Tensor sizes: [15, 26, 1]`

and my script setting is : nohup python pretrain_nmt_new.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en,es,vi,id,th,pt,zh,ko --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 &

prajdabre commented 2 years ago

No you don't need the domain classifier flags.

The reason for the failure is that the language ids you use should be corresponding to what is used in mbart.

en should be EN_XX

Look at the official mbart model repo and find the ids for other languages.

raullese commented 2 years ago

@prajdabre very thanks for your Quickly reply

this is the official format of langs in mBart-large-50 `Arabic (ar_AR), Czech (cs_CZ), German (de_DE), English (en_XX), Spanish (es_XX), Estonian (et_EE), Finnish (fi_FI), French (fr_XX), Gujarati (gu_IN), Hindi (hi_IN), Italian (it_IT), Japanese (ja_XX), Kazakh (kk_KZ), Korean (ko_KR), Lithuanian (lt_LT), Latvian (lv_LV), Burmese (my_MM), Nepali (ne_NP), Dutch (nl_XX), Romanian (ro_RO), Russian (ru_RU), Sinhala (si_LK), Turkish (tr_TR), Vietnamese (vi_VN), Chinese (zh_CN), Afrikaans (af_ZA), Azerbaijani (az_AZ), Bengali (bn_IN), Persian (fa_IR), Hebrew (he_IL), Croatian (hr_HR), Indonesian (id_ID), Georgian (ka_GE), Khmer (km_KH), Macedonian (mk_MK), Malayalam (ml_IN), Mongolian (mn_MN), Marathi (mr_IN), Polish (pl_PL), Pashto (ps_AF), Portuguese (pt_XX), Swedish (sv_SE), Swahili (sw_KE), Tamil (ta_IN), Telugu (te_IN), Thai (th_TH), Tagalog (tl_XX), Ukrainian (uk_UA), Urdu (ur_PK), Xhosa (xh_ZA), Galician (gl_ES), Slovene (sl_SI)

` from https://huggingface.co/facebook/mbart-large-50

and what I don't understand is When I pre-trained a single language, I did just use "en", not "EN_XX", and it run successful, and in the project's examples I saw in example/train_mbart_model.sh are all format similar to "--langs hi,en,vi",

prajdabre commented 2 years ago

That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).

raullese commented 2 years ago

That's because coincidentally the token en existed in the Mbart tokenizer. I'm betting that the token zh isn't present in the tokenizer and is split as "z# h". Whenever a language token is split into multiple parts my code crashes (intentionally).

ok, thanks, I will try it, and by the way, besides the --langs, as for the format of --mono_src, It is also necessary to strictly observe the format like zh_CN ? for example, transfer train.zh to train. zh_CN

prajdabre commented 2 years ago

No the training files can have any suffix.

Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.

raullese commented 2 years ago

No the training files can have any suffix.

Only during tokenizer training, the training files should have proper suffixes which act as language indicator tokens which you plan to use for model training and decoding.

Thank you very much. After some setting process, it has been successful;

And by the way, is that mean, if I continue pre train with new language, for example, I continue pre train based on mBart-50, but there is not Traditional Chinese in mBart50, only have Simplified Chinese, and if I add traditional Chinese, I have to pre train from scratch? @prajdabre

prajdabre commented 2 years ago

To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.

If it does then you may directly train. If not you have 2 strategies:

  1. Convert from traditional to simplified using some mapping table.
  2. Pretrain from scratch.
raullese commented 2 years ago

To my understanding, mbart-50 does not officially support traditional Chinese. Firstly you will have to check if the mbart-50 tokenizer can handle all traditional characters or not.

If it does then you may directly train. If not you have 2 strategies:

  1. Convert from traditional to simplified using some mapping table.
  2. Pretrain from scratch.

Thanks a lot @prajdabre

raullese commented 2 years ago

@prajdabre oh, I have another confusion,I'm running continue pre training task train_mbart_model.sh in 2-GPUs based on 8languages Training for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR'], and the setting as follow: nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 & and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet, Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.

do you know the reason?

raullese commented 2 years ago

@prajdabre oh, I have another confusion,I'm running continue pre training task train_mbart_model.sh in 2-GPUs based on 8languages Training for: ['en_XX', 'es_XX', 'vi_VN', 'id_ID', 'th_TH', 'pt_XX', 'zh_CN', 'ko_KR'], and the setting as follow: nohup python pretrain_nmt.py -n 1 -nr 0 -g 2 --model_path gen_model/mbart_v1/mbart-50-v1 --tokenizer_name_or_path pretrain_model/mbart-50 --langs en_XX,es_XX,vi_VN,id_ID,th_TH,pt_XX,zh_CN,ko_KR --mono_src hyb_train_v1/train.en,hyb_train_v1/train.es,hyb_train_v1/train.vi,hyb_train_v1/train.id,hyb_train_v1/train.th,hyb_train_v1/train.pt,hyb_train_v1/train.zh,hyb_train_v1/train.ko --encoder_layers 12 --decoder_layers 12 --encoder_attention_heads=12 --decoder_attention_heads=12 --encoder_ffn_dim=128 --decoder_ffn_dim=4096 --d_model=1024 --batch_size 512 --use_official_pretrained --pretrained_model pretrain_model/mbart-50 --no_reload_optimizer_ctr_and_scheduler --long_save_every 50000 --save_intermediate_checkpoints --shard_files > gen_model/mbart_v1/run_train.log 2>&1 & and the strange thing is that after 380k batches, the run_train.log shows only one language's training(ko_KR), and it seems that another language not appear in the run_train.log, That is to say, they have not participated in the training yet, Finished epoch 10 for language: ko_KR 379500 6.06 42.11 seconds for 100 batches. Memory used post forward / backward passes: 11.9 / 13.32 GB. 379600 5.85 42.44 seconds for 100 batches. Memory used post forward / backward passes: 11.92 / 13.33 GB. 379700 5.74 42.34 seconds for 100 batches. Memory used post forward / backward passes: 11.98 / 13.36 GB. 379800 5.44 38.15 seconds for 100 batches. Memory used post forward / backward passes: 10.72 / 12.69 GB. 379900 6.02 42.54 seconds for 100 batches. Memory used post forward / backward passes: 12.08 / 13.37 GB.

do you know the reason?

by the way, supplement information of data lang|raws en | 84747184 es | 7923542 vi | 21776227 pt | 3865782 th | 1343809 id | 14221194 zh_cn | 15662924 ko | 38642

and --num_batches = 2000000, bach_size = 512

prajdabre commented 2 years ago

Hi

Your supplementary information about corpora sizes answers it all.

Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.

raullese commented 2 years ago

Hi

Your supplementary information about corpora sizes answers it all.

Since Korean has the smallest data it will finish far more epochs before others finish one epoch. This is because there is a data sampling hyperparam called --data_sampling_temperature which is set to 5. This means that smaller data will be seen more often to keep training from focusing on the higher resource languages. I think you will see 1 epoch for Thai after 20 or so epochs for Korean.

oh thank you very much, I see I have neglected the parameter --data_sampling_temperature, and If that's the case, then I'll have to thinking about resetting --num_batches to another larger value, Before, I would have thought that if there was no oversampling, the whole epoch number of all language is 7: num of batch in per epoch is: Sun_raws(0.14billion) / batch_size(512) = 273437 so num of epoch is: num_batches(2000000) / num of batch in per epoch(273437)= 7

but now after 380k batches, haven't even finished one epoch for another 7 languages, so probably after 2000000 batches, some high resource language will not be fully trained, I don't know if I'm right in thinking this way @prajdabre

prajdabre commented 2 years ago

A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.

Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.

raullese commented 2 years ago

A correction: The batch size doesn't indicate lines but number of tokens. If your corpus contains paragraphs then each batch per GPU contains only 2 or 3 entries for a batch of 512 tokens. If it's sentences then probably 8 or 10 sentences. Note that you should set --hard_truncate_length to 512 and --max_length to 512 as well. Else your training will skip all the data in case of paragraphs.

Anyway, with just 2 GPUs, you are going to need several tens of millions of steps. I'm afraid that with the scale of data you want to work with you need more GPUs to get results quickly. I recommend filtering the dataset to a more manageable size like 14 million examples across all languages.

@prajdabre thank you very much, it seems that my previous setting doesn't seem reasonable, actually my training set in all language is all about sentence, not paragraphs, I will set --hard_truncate_length to 512 and --max_length to 512, and a higher value of --num_batches, and set reduce the size of the training set to less than half of the current size