Problems on my own task

cjwen15 commented 3 years ago

Here's the running error.

Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated Traceback (most recent call last): File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in main() File "/home/wen/anaconda3/envs/pytorch/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/wen/anaconda3/envs/pytorch/bin/python3', '-u', 'Trainer.py', '--local_rank=0', '--do_train', '--do_eval', '--do_predict', '--evaluate_during_training', '--data_dir=data/dataset/COIE/origin', '--output_dir=data/result/COIE/origin/lebertcrf', '--config_name=data/berts/bert/config.json', '--model_name_or_path=data/berts/bert/pytorch_model.bin', '--vocab_file=data/berts/bert/vocab.txt', '--word_vocab_file=data/vocab/tencent_vocab.txt', '--max_scan_num=1000000', '--max_word_num=5', '--label_file=data/dataset/COIE/origin/labels.txt', '--word_embedding=data/embedding/Tencent_AILab_ChineseEmbedding.txt', '--saved_embedding_dir=data/dataset/COIE/origin', '--model_type=WCBertCRF_Token', '--seed=106524', '--per_gpu_train_batch_size=4', '--per_gpu_eval_batch_size=16', '--learning_rate=1e-5', '--max_steps=-1', '--max_seq_length=256', '--num_train_epochs=20', '--warmup_steps=190', '--save_steps=600', '--logging_steps=100']' died with <Signals.SIGKILL: 9>.

And here's the log.

2021-07-07 15:40:29:INFO: Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False 2021-07-07 15:40:29:INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, config_name='data/berts/bert/config.json', data_dir='data/dataset/COIE/origin', default_label='O', device=device(type='cuda', index=0), do_eval=True, do_predict=True, do_shuffle=True, do_train=True, evaluate_during_training=True, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, label_file='data/dataset/COIE/origin/labels.txt', learning_rate=1e-05, local_rank=0, logging_dir='data/log', logging_steps=100, max_grad_norm=1.0, max_scan_num=1000000, max_seq_length=256, max_steps=-1, max_word_num=5, model_name_or_path='data/berts/bert/pytorch_model.bin', model_type='WCBertCRF_Token', n_gpu=1, no_cuda=False, nodes=1, num_train_epochs=20, output_dir='data/result/COIE/origin/lebertcrf', overwrite_cache=True, per_gpu_eval_batch_size=16, per_gpu_train_batch_size=4, save_steps=600, save_total_limit=50, saved_embedding_dir='data/dataset/COIE/origin', seed=106524, sgd_momentum=0.9, vocab_file='data/berts/bert/vocab.txt', warmup_steps=190, weight_decay=0.0, word_embed_dim=200, word_embedding='data/embedding/Tencent_AILab_ChineseEmbedding.txt', word_vocab_file='data/vocab/tencent_vocab.txt')

Hope you can reply. Thx.

liuwei1206 commented 3 years ago

Hi,

I guess this is an error about distributed training. Now, I also don't know how to solve it since I haven't used distributed training for a long time. Sorry!

cjwen15 commented 3 years ago

Hi,

I guess this is an error about distributed training. Now, I also don't know how to solve it since I haven't used distributed training for a long time. Sorry!

If I don't want to apply the distributed training, which code I need to edit or refactor? Cuz I just have a single GPU.

liuwei1206 commented 3 years ago

Sure, you can do it and it's very easy. You can read the annotation of the code and remove the code for distributed training.

liuwei1206 / LEBERT

Problems on my own task #23