Replication of finetuning code

VilhelmHovland commented 5 months ago

Hello, I want to try finetuning your model with own data but I have two questions:

I am trying to replicat eyour finetuning code but if I try finetuning the larger version of FLAN-T5 I run into memory capacity issues. I am just using the wordnet dataset from huggingface, training one epoch with a batch size of 1 and reduced lengths. It appears to not run on multiple nodes. How could I solve this?
How should I format my data in order to use it for further finetuning?

Thank you for any assistance here.

jacklanda commented 4 months ago

I suggest giving up on the reproduction, my friend.

Code tastes bitter and Truth goes opaque.

akutuzov commented 4 months ago

Hi @VilhelmHovland

What exact version of FLAN-T5 are you using, and what fine-tuning parameters? To fine-tune FLAN-T5 Large, 40 GB of GPU RAM should be enough (probably even 24). For the XL version, you'll need more. We did not fine-tune on multiple nodes, but used multiple GPUs on one node to increase the global batch size - it worked fine. To be more precise, using 8 GPUs with 64 GB of RAM each allowed us to fine-tune FLAN-T5 XL with the global batch size 32 (we also set gradient_accumulation_steps=4 and truncated the maximum input length to 160 tokens). You can probably go even beyond that by using reduced precision.
Our fine-tuning code assumes your training dataset is a tab-separated file with two columns: examples and definitions. The validation dataset should be in the same format, of course.

Any other questions are welcome.

akutuzov commented 4 months ago

I suggest giving up on the reproduction, my friend. Code tastes bitter and Truth goes opaque.

@jacklanda I am not sure what do you mean by that?

glnmario commented 4 months ago

I'm guessing poetry generation :wine_glass:

VilhelmHovland commented 4 months ago

@akutuzov Thank you for your response. The parameters I have been testing with have been: --model_name_or_path="google/flan-t5-xl" \ --cache_dir="/vilhelm/.cache/" \ --do_train \ --do_eval \ --dataset_name="marksverdhei/wordnet-definitions-en-2021" \ --output_dir="/vilhelm/finetune_output/" \ --overwrite_output_dir \ --evaluation_strategy=epoch \ --logging_strategy=epoch \ --per_device_train_batch_size=1 \ --per_device_eval_batch_size=1 \ --predict_with_generate \ --save_total_limit=5 \ --max_source_length=5 \ --max_target_length=5 \ --fp16=True \ --num_train_epochs=1 \ --save_strategy=epoch \ --load_best_model_at_end=True \ --metric_for_best_model=eval_rouge1 \ --ddp_find_unused_parameters=False \ --optim=adafactor \

I have been running it using 4 32GB V100 gpus at the Puhti supercomputer, on a single node.

akutuzov commented 4 months ago

@VilhelmHovland I believe the root of your troubles is this line: --dataset_name="marksverdhei/wordnet-definitions-en-2021"

You are trying to use the Wordnet dataset directly as it is on HF. We didn't try that, and I doubt the fine-tuning script deals with this well. As mentioned before, we fine-tune on tab-separated files with two columns: examples and definitions, without directly using the datasets library. This allows much more flexibility. You should point to the training and validation data files with these arguments:

--train_file ${TRAIN_DATASET} \
--validation_file ${VAL_DATASET} \

(see the example here)

Note that the examples should be already augmented with the instruction prompt ("What is the definition of TARGET_WORD?" or whatever prompt you are using).

VilhelmHovland commented 4 months ago

@akutuzov I see, thank you. Is the exact data you used available anywhere, or do I need to process the CoDWoE and naacl data?

akutuzov commented 4 months ago

"naacl data" means datasets from Ishivatari et al 2019, right? Then yes, you'll have to convert them to the tab-separated format I described above. Same with CoDWoE - it comes as json files, but it's trivial to convert them to .tsv.

We did not publish our converted versions, since we felt it would be not polite to re-distribute datasets created by others (simply saved in another format). Again, it should be trivial convert these datasets to .tsv and add the instruction prompt. If you encounter any difficulties with that, get in touch with me, I'll share our preprocessed files privately.

VilhelmHovland commented 4 months ago

Hello again, I have now changed my data, but I am still getting the same error. I am using the same parameters except with direct data files. I formatted them like this, in .tsv files, does it look correct? What else could be causing issues?

example definition cranial pressure What is the definition of cranial? of or relating to the cranium which encloses the brain an easy job What is the definition of easy? posing no difficulty

akutuzov commented 4 months ago

@VilhelmHovland did you try to fine-tune a smaller model (flan-t5-base, foe example), and/or removing the --fp16=True argument?

akutuzov commented 4 months ago

@VilhelmHovland I've just tried to fine-tune the flan-t5-base model on the few lines you quoted above. I repeated them multiple times, so that in the end I got a file with 12 instances (the file is here).

On this toy dataset, fine-tuning with batch size 4 and 2 epochs completed without any issues. I used one A100 GPU with 40GB of RAM. Here is the exact command:

python3 finetune_flan.py \
    --model_name_or_path google/flan-t5-base \
    --do_train \
    --do_eval \
    --train_file example_dataset.tsv \
    --validation_file example_dataset.tsv \
    --output_dir test_model \
    --overwrite_output_dir \
    --evaluation_strategy=epoch \
    --logging_strategy=epoch \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate \
    --save_total_limit=5 \
    --max_source_length=192 \
    --max_target_length=128 \
    --bf16=False \
    --num_train_epochs=2 \
    --save_strategy=epoch \
    --load_best_model_at_end=True \
    --metric_for_best_model=eval_rouge1 \
    --ddp_find_unused_parameters=False \
    --optim=adafactor \
    --report_to=none \

VilhelmHovland commented 4 months ago

Okay, I tried as well, it does work now, thank you. What would be the bottleneck for finetuning the larger models then? Is there any way I could get it to work for those as well?

akutuzov commented 4 months ago

Well, the usual procedure: set the per-device batch size to 1, and then increase it until you hit out-of-memory error again. This will be your ceiling in terms of RAM. Often, you can increase the batch size even more by using gradient accumulation (at the cost of slower training). Using more than one GPU (within one node) will also naturally allow you to have a larger global batch size, which is usually a good thing.

ltgoslo / definition_modeling

Replication of finetuning code #6