huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
130.69k stars 25.99k forks source link

Use `run_language_modeling.py` to finetune gpt2, but it core unexpectedly. #7020

Closed Abbyyan closed 3 years ago

Abbyyan commented 3 years ago

❓ Questions & Help

Use run_language_modeling.py to finetune gpt2, but it core unexpectedly. How can i find the cause of the core please?

Details

I'm using transformers's run_language_modeling.py to finetune gpt2 as follows.

(1)with gpu

python3 run_language_modeling.py     --output_dir=/home/xxx/transformers/examples/language-modeling/output_dir     --model_type=gpt2     --model_name_or_path=gpt2     --per_gpu_train_batch_size=1     --do_train     --train_data_file=/home/xxx/data_info/transformer.data     --block_size=512 --save_steps=500 --overwrite_output_dir

But it core unexpectedly.

terminate called after throwing an instance of 'std::runtime_error'                                                                                  | 0/1348 [00:00<?, ?it/s]
  what():  NCCL Error 1: unhandled cuda error
Aborted

And the returncode is 134image

(2) with cpu

python3 run_language_modeling.py     --output_dir=/home/xxx/transformers/examples/language-modeling/output_dir     --model_type=gpt2     --model_name_or_path=gpt2        --do_train     --train_data_file=/home/xxx/data_info/transformer.data     --block_size=512 --save_steps=500 --overwrite_output_dir --no_cuda 

At first the training is normal , but will quit after a while with return code 137. image

(3) question

The dataset i use is just a data.txt which is a file combined with multiple articles , and a <|endoftext|> is added at the end of each article. How can i find the cause of the core please? Hope for your help. Thanks a lot.

LysandreJik commented 3 years ago

Hi, do you mind copy-pasting the text in your terminal instead of putting images? It'll be easier for us to understand, to debug, and for other users to search for similar issues. Thanks!

Abbyyan commented 3 years ago

Hi, do you mind copy-pasting the text in your terminal instead of putting images? It'll be easier for us to understand, to debug, and for other users to search for similar issues. Thanks!

I've figure out the problem of gpu core. According to https://github.com/pytorch/pytorch/issues/31285, my gpu card is not supported by pytorch. I'm trying to build pytorch from source. But it's weird when i using run_language_modeling.py with --no_cuda. There is no error message. The command i used is

python3 run_language_modeling.py     --output_dir=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir     --model_type=gpt2     --model_name_or_path=gpt2     --per_device_train_batch_size=1     --do_train     --train_data_file=/home/xxx/gpt_model/data_info/transformer.data     --block_size=512 --save_steps=500 --overwrite_output_dir --no_cuda

The output message is

09/09/2020 16:23:23 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
09/09/2020 16:23:23 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir='/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, evaluate_during_training=False, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Sep09_16-23-23_TENCENT64.site', logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=True, seed=42, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1, run_name=None, disable_tqdm=False, remove_unused_columns=True)
/home/xxx/anaconda3/envs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:821: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
  FutureWarning,
/home/xxx/anaconda3/envs/transformers/lib/python3.6/site-packages/transformers/tokenization_utils_base.py:1321: FutureWarning: The `max_len` attribute has been deprecated and will be removed in a future version, use `model_max_length` instead.
  FutureWarning,
09/09/2020 16:23:31 - INFO - filelock -   Lock 140250526322472 acquired on /home/xxx/gpt_model/data_info/cached_lm_GPT2Tokenizer_512_transformer.data.lock
09/09/2020 16:23:32 - INFO - filelock -   Lock 140250526322472 released on /home/xxx/gpt_model/data_info/cached_lm_GPT2Tokenizer_512_transformer.data.lock

/home/xxx/anaconda3/envs/transformers/lib/python3.6/site-packages/transformers/trainer.py:247: FutureWarning: Passing `prediction_loss_only` as a keyword argument is deprecated and won't be possible in a future version. Use `args.prediction_loss_only` instead.
  FutureWarning,
<library>In get_train_dataloader ,tarin_batch_size =  1
Epoch:   0%|                                                                                                                                            | 0/3 

Killedion:   4%|████▎                                                                                                                    | 194/5390 [04:30<2:58:02,  2.06s/it]  

And echo $?, the return code is 137. Thanks a lot.

LysandreJik commented 3 years ago

The return code 137 means that you have an out of memory error. Do you get the same error if you use distilgpt2 with a --block_size=64? (Just for testing purposes).

We've also recently patched a memory error on the Trainer, could you install from source to benefit from the fix? You can do so as such:

pip install git+https://github.com/huggingface/transformers

Abbyyan commented 3 years ago

pip install git+https://github.com/huggingface/transformers

Yes, I found Out of memory log in /var/log/messages and it turns out the process use a lot of memory.

Sep  9 17:10:44 centos kernel: Out of memory: Kill process 126138 (python3) score 939 or sacrifice child
Sep  9 17:10:44 centos kernel: Killed process 126170 (python3) total-vm:325690732kB, anon-rss:125105968kB, file-rss:0kB

And this is my machine info (caused in shell with top command)

top - 17:47:55 up 303 days,  2:35,  4 users,  load average: 1.14, 2.98, 3.44
Tasks: 468 total,   1 running, 467 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13132300+total, 12383201+free,  3294300 used,  4196688 buff/cache
KiB Swap:  2088956 total,       12 free,  2088944 used. 12565272+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
  2868 root      20   0 3006004  19072      0 S   0.3  0.0 225:15.55 dockerd                                                                                                 

Then i use pip uninstall transformers; pip install git+https://github.com/huggingface/transformers to reinstall the transformers library, and run the run_language_modeling.py again.

python3 run_language_modeling.py     --output_dir=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir --model_type=gpt2  --model_name_or_path=distilgpt2 --per_device_train_batch_size=1     --do_train     --train_data_file=/home/xxx/gpt_model/data_info/data.txt --block_size=64 --save_steps=500 --overwrite_output_dir --no_cuda

This is the memory info showed by top after i run run_language_modeling.py with cpu.

top - 17:51:40 up 303 days,  2:38,  4 users,  load average: 23.42, 9.30, 5.50
Tasks: 463 total,   2 running, 461 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34.4 us, 49.5 sy,  0.0 ni, 16.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13132300+total, 12142824+free,  5283664 used,  4611096 buff/cache
KiB Swap:  2088956 total,       12 free,  2088944 used. 12365893+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
 25414 xxx   20   0  0.188t 1.945g  83876 R  3356  1.6  47:31.03 python3   

It already runs about two hours and works well. Thanks a lot.

Abbyyan commented 3 years ago

By the way, how can i use the fine-tuned model please? There are 6 files generated under output_dir/checkpoint, which are config.json log_history.json optimizer.pt pytorch_model.bin scheduler.pt training_args.bin. Then how can i use them? Should i just use them as follows? Does the fine-tuned model need to be renamed? Which level of the checkpoint directory should I specify? Thanks a lot.

tokenizer = GPT2Tokenizer.from_pretrained('/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir/checkpoint-15000')
LysandreJik commented 3 years ago

Cool! Yes, the way you load the model is correct. I'm guessing you want to use the resulting model, so given that you passed the following as --output_dir: --output_dir=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir, you should be able to load it like

tokenizer = GPT2Tokenizer.from_pretrained('/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir')
model = GPT2LMHeadModel.from_pretrained('/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir')
Abbyyan commented 3 years ago

Cool! Yes, the way you load the model is correct. I'm guessing you want to use the resulting model, so given that you passed the following as --output_dir: --output_dir=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir, you should be able to load it like

tokenizer = GPT2Tokenizer.from_pretrained('/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir')
model = GPT2LMHeadModel.from_pretrained('/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir')

Got it! I use --output_dir: --output_dir=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir and there are many checkpoint generate under it. Just choose one of them, for example, checkpoint-14000 and copy 'vocab.json', 'merges.txt' into /home/xxx/gpt_model/transformers/examples/language-modeling/output_dir/checkpoint-14000.

/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir/checkpoint-14000 >>> wget https://cdn.huggingface.co/distilgpt2-vocab.json -O vocab.json
/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir/checkpoint-14000 >>> wget https://cdn.huggingface.co/distilgpt2-merges.txt -O merges.txt

Then i use the run_generation.py like follows and it generate text as expect.

python run_generation.py --model_type=gpt2 --model_name_or_path=/home/xxx/gpt_model/transformers/examples/language-modeling/output_dir/checkpoint-14000  --no_cuda

Thank you for your help!

LysandreJik commented 3 years ago

Very cool, glad you got it to work! Let us know if you face any other issues.

zeyuyun1 commented 3 years ago

It seems like I still have the same issue. I try to use run_language_modeling.py to train a small-bert model (6 layer) from scratch. The process is killed after about 3 hours of training. The error message is simply "Killed" and I observes there's a constant increasing usage of memory. So I think the issue is also OOM

top - 16:36:33 up  5:29,  1 user,  load average: 1.07, 1.19, 1.16
Tasks: 353 total,   3 running, 349 sleeping,   0 stopped,   1 zombie
%Cpu(s): 10.4 us,  1.2 sy,  0.0 ni, 88.1 id,  0.1 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  16010.9 total,    151.5 free,   3386.3 used,  12473.2 buff/cache
MiB Swap:   2048.0 total,   1550.2 free,    497.8 used.  12181.5 avail Mem 

Is this normal? I can't think of a reason why it would use this much memory.