microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.07k stars 2.54k forks source link

Meet a StopIteration when continue training infoxlm from xlmr #807

Closed SAI990323 closed 2 years ago

SAI990323 commented 2 years ago

I try to continue training a infoxlm from xlmr on my own dataset. After I initialize the conda environment and prepare the training data. I use the following bash to train, but it throws a StopIteration Error. The bash I used is here. python src-infoxlm/train.py ${MLM_DATA_DIR} \ --task infoxlm --criterion xlco \ --tlm_data ${TLM_DATA_DIR} \ --xlco_data ${XLCO_DATA_DIR} \ --arch infoxlm_base --sample-break-mode complete --tokens-per-sample 512 \ --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 1.0 \ --lr-scheduler polynomial_decay --lr 0.0002 --warmup-updates 10000 \ --total-num-update 200000 --max-update 200000 \ --dropout 0.0 --attention-dropout 0.0 --weight-decay 0.01 \ --max-sentences 8 --update-freq 8 \ --log-format simple --log-interval 1 --disable-validation \ --save-interval-updates 10000 --no-epoch-checkpoints \ --seed 1 \ --save-dir ${SAVE_DIR}/ \ --tensorboard-logdir ${SAVE_DIR}/tb-log \ --roberta-model-path $HOMEPATH/xlmr.base/model.pt \ --num-workers 4 --ddp-backend=c10d --distributed-no-spawn \ --xlco_layer 8 --xlco_queue_size 256 --xlco_lambda 1.0 \ --xlco_momentum constant,0.9999 --use_proj

stvhuang commented 2 years ago

Hi @SAI990323, did you solve the problem? I also encounter the same error.

SAI990323 commented 2 years ago

It's been a long time, I can not remember the detail. The error may be attributed to two points

Hope it can help you.

stvhuang commented 2 years ago

Thanks for your reply!

SAI990323 commented 2 years ago

Thanks for your reply!

I meet this problem again and remember what I did last time. The problem for me is that the number of data is not enough to fill the xlco_queue This happens only when I test the bash code.

stvhuang commented 2 years ago

Yes, I also solve this problem with using a larger size of training data. :)