Can you provide some advices on how to fine tune your pretrained models on my own dataset?

Create789 commented 4 years ago

Hello! Your work is amazing! Thank you. Can you provide some instruction on how I can fine tune your models on some specific corpus? How long such process would be?

mgrankin commented 4 years ago

Hello, I'm in the process of writing README for that.

mgrankin commented 4 years ago

I've updated README. If there is something not clear, let me know.

piegu commented 4 years ago

Hello,

I re-open this thread because I'm training gpt-2 124 Mo in Portuguese using your script and I have questions on how to do it effectively.

I downloaded Wikipedia in Portuguese using your script nputils.py of fastai and after cleaning, I got a dataset of 1.2 Go that I divided into 2 files: train.txt (1.08 Go, 90%) and test.txt (120 Mo, 10%).
I did create a new vocabulary usando YTTM and your code yttm bpe --data ./corpus/data/train.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999
I'm using GCP with a GPU Tesla V100 (16 Go RAM). Beacuse of that, my batch size is 4 (not 8 like yours) to avoid CUDA OUT OF MEMORY.
I launched yesterday night the following code and it is still running after 10 hours (only 1 epoch...). Great but... I notice that perplexity did not decrease (it is about 5162) since the beginning of the training (it can be because of my "wrong" change in your run_lm_finetuning.py script as written here?).

My questions:

When starting a training of GPT-2 (using the English GPT-2) in another language than English, what is the best strategy? unfreeze all (-1) as the change of vocabulary to another lanhuage implies that parameters values of GPT-2 mean nothing or unfreeze layer by layer starting with layer o (last one)?
What is the best strategy for the learning rate? Decay (as I did with --lr_decay) or not?
There are frequently evaluations during the training: great to follow the performance of the training but how to use this information? I can stop the training (CTRL + C), change values of hyperparameters (like the one of LR for example) and start from the breakpoint? If yes, how to do it?

Thanks in advance.

export TRAIN_FILE=./corpus/data/train.txt
export TEST_FILE=./corpus/data/test.txt
export CUDA_VISIBLE_DEVICES=0
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=4
export LR=5e-5
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --overwrite_output_dir \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_SIZE \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size=$BS \
    --save_steps=10000 \
    --logging_steps=1 \
    --fp16 \
    --fp16_opt_level=O2 \
    --warmup_samples=16000 \
    --learning_rate=$LR \
    --tokenizer_class=YTEncoder \
    --tokenizer_name=bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps=1000 \
    --eval_data_file=$TEST_FILE \
    --unfreeze_level=0 \
    --lr_decay

mgrankin commented 4 years ago

it can be because of my "wrong" change in your run_lm_finetuning.py script as written here?

Yes, it is.

piegu commented 4 years ago

Thank you Mikhail. I will come back to your initial code even with a warning. The most important for me now is:

to understand the steps to fine-tune well (unfreeze all or one by one? Learning Rate with decay or not?)
how to generate text after the training (issue 12)

About the point 2, I do not understand in particular how to visualize my new vocab (the one in Portuguese that I created through your code yttm bpe --data ./corpus/data/train.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999).

I would like to get (after tokenization with YTTM) a file like gpt2-vocab.json.

Thanks in advance for any help.

mgrankin commented 4 years ago

I've trained the model using gradual unfreezing with '--unfreeze_level' parameter. The sequence was 0,1,2,7,-1 (as in the table with results). When loss dont't improve for a day I switch to next value (like from 2 to 7). You can find my exact scripts in tpu/schedule_small.txt and tpu/schedule_medium.txt.

nikhilno1 commented 4 years ago

Hi @piegu, were you able to resolve your second question? How do you get back vocab.json and merges.txt files which the run_generation.py script requires?

mgrankin / ru_transformers

Can you provide some advices on how to fine tune your pretrained models on my own dataset? #3