Confused on how to train the model.

gdet commented 3 years ago

Hello,

sorry if this is a silly question. I am trying to finetune the English GPT-2 model in my language. I tried without the "while loop" command and I got a memory issue problem (oom error) and the algorithm crashed. So then I tried to run it like this:

export TRAIN_FILE=/ru_transformers/fulljan21
export CUDA_VISIBLE_DEVICES=2
export MODEL_SIZE=gpt2-large
export OUTPUT=output_yt/lfeb
export BS=1
export LR=1e-5

  while true
  do
      python run_lm_finetuning.py \
                --output_dir=$OUTPUT \
          --model_type=gpt2 \
          --model_name_or_path=$OUTPUT \
          --do_train \
          --train_data_file=$TRAIN_FILE \
          --per_gpu_train_batch_size $BS \
          --save_steps=10000 \
          --logging_steps=10 \
          --fp16 \
          --fp16_opt_level O2 \
          --warmup_samples 16000 \
          --learning_rate $LR \
          --overwrite_output_dir \
          --tokenizer_class YTEncoder \
          --tokenizer_name bpe/yt.model \
          --do_eval \
          --evaluate_during_training \
          --eval_steps 1000 \
          --eval_data_file=./data/classic/valid \
          --save_total_limit 30 \
          --num_train_epochs 10.0 \
          --unfreeze_level 0

      sleep 1
  done

So now the model is running for around two weeks and it keeps evaluating. I cannot see the percentage of epochs or a sample.txt file that I could see when I was running the model with a very small sample of data (2gb). Now my files are around 40gb. I am using a NVIDIA Corporation GV102 graphic card and she is working in aroung 40%. Am I doing something wrong?

Currently I have this output:

 02/10/2021 14:35:22 - INFO - __main__ -   Loading features from ./evaluate_dir
 ./evaluate_dir

1675459███████████████████████████████████████████████████████████████████████████████████████████████| 100.00% [2259/2259 00:17<00:00]

 02/10/2021 14:35:39 - INFO - __main__ -   ***** Running evaluation checkpoint-27000 *****
 02/10/2021 14:35:39 - INFO - __main__ -     Num examples = 1675459
 02/10/2021 14:35:39 - INFO - __main__ -     Batch size = 4
 Evaluating:  17%|███████████████████████████████▏                                                                                                                                                   | 72983/418865 [58:22<4:35:44, 20.91it/s]

Thank you

mgrankin commented 3 years ago

Probably something wrong. It should be training and after 1000 steps of training it should evaluate.

gdet commented 3 years ago

I am sorry I copy pasted the wrong loop

 while true
 do
   python3 run_lm_finetuning_jan2021.py \
     --output_dir=$OUTPUT \
     --model_type=gpt2 \
     --model_name_or_path=$MODEL_SIZE \
     --do_train \
     --train_data_file=$TRAIN_FILE \
     --per_gpu_train_batch_size $BS \
     --save_steps=10000 \
     --logging_steps=10 \
     --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name modeljan2021/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./evaluate_dir \
    --save_total_limit 30 \
    --num_train_epochs 10.0 \
    --block_size=64 \
    --unfreeze_level 0 \

     sleep 1
 done

Is there maybe a problem with block_size=64? I added the block_size because without it, this loop was not working. The len(tokenized_text)-block_size was a minus value :

  for i in range(rnd_shift, len(tokenized_text)-block_size+1, block_size):
         examples.append(tokenizer.add_special_tokens_single_sentence(tokenized_text[i:i+block_size]))

What should I look for? Any information will be helpful. Thank you

mgrankin commented 3 years ago

Run program in VSCODE and look what is going on in the train procedure. It should loop over the data.

gdet commented 3 years ago

It is a ubuntu server without UI. So I cannot use VSCODE. How could I see this in ubuntu?

mgrankin commented 3 years ago

You run VSCODE on your laptop. It has remote-ssh plugin, install it from the Extensions side panel. Use it to connect to your server and run the code.

gdet commented 3 years ago

ok thank you very much. I will try to check what is happening and send you again if needed.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mgrankin / ru_transformers

Confused on how to train the model. #45