Closed cjwen15 closed 3 years ago
Hi,
I guess this is an error about distributed training. Now, I also don't know how to solve it since I haven't used distributed training for a long time. Sorry!
Hi,
I guess this is an error about distributed training. Now, I also don't know how to solve it since I haven't used distributed training for a long time. Sorry!
If I don't want to apply the distributed training, which code I need to edit or refactor? Cuz I just have a single GPU.
Sure, you can do it and it's very easy. You can read the annotation of the code and remove the code for distributed training.
Here's the running error.
Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated Traceback (most recent call last): File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in
main()
File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/wen/anaconda3/envs/pytorch/bin/python3', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/COIE/origin', '--output_dir=data/result/COIE/origin/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=1000000', '--max_word_num=5', '--label_file=data/dataset/COIE/origin/labels.txt', '--word_embedding=data/embedding/Tencent_AILab_ChineseEmbedding.txt', '--saved_embedding_dir=data/dataset/COIE/origin', '--model_type=WCBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=16', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=20', '--warmup_steps=190', '--save_steps=600', '--logging_steps=100']' died with <Signals.SIGKILL: 9>.
And here's the log.
2021-07-07 15:40:29:INFO: Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 2021-07-07 15:40:29:INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, config_name='data/berts/bert/config.json', data_dir='data/dataset/COIE/origin', default_label='O', device=device(type='cuda', index=0), do_eval=True, do_predict=True, do_shuffle=True, do_train=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_file='data/dataset/COIE/origin/labels.txt', learning_rate=1e-05, local_rank=0, logging_dir='data/log', logging_steps=100, max_grad_norm=1.0, max_scan_num=1000000, max_seq_length=256, max_steps=-1, max_word_num=5, model_name_or_path='data/berts/bert/pytorch_model.bin', model_type='WCBertCRF_Token', n_gpu=1, no_cuda=False, nodes=1, num_train_epochs=20, output_dir='data/result/COIE/origin/lebertcrf', overwrite_cache=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=4, save_steps=600, save_total_limit=50, saved_embedding_dir='data/dataset/COIE/origin', seed=106524, sgd_momentum=0.9, vocab_file='data/berts/bert/vocab.txt', warmup_steps=190, weight_decay=0.0, word_embed_dim=200, word_embedding='data/embedding/Tencent_AILab_ChineseEmbedding.txt', word_vocab_file='data/vocab/tencent_vocab.txt')
Hope you can reply. Thx.