Open VilhelmHovland opened 5 months ago
I suggest giving up on the reproduction, my friend.
Code tastes bitter and Truth goes opaque.
Hi @VilhelmHovland
gradient_accumulation_steps=4
and truncated the maximum input length to 160 tokens). You can probably go even beyond that by using reduced precision.examples
and definitions
. The validation dataset should be in the same format, of course.Any other questions are welcome.
I suggest giving up on the reproduction, my friend. Code tastes bitter and Truth goes opaque.
@jacklanda I am not sure what do you mean by that?
I'm guessing poetry generation :wine_glass:
@akutuzov Thank you for your response. The parameters I have been testing with have been: --model_name_or_path="google/flan-t5-xl" \ --cache_dir="/vilhelm/.cache/" \ --do_train \ --do_eval \ --dataset_name="marksverdhei/wordnet-definitions-en-2021" \ --output_dir="/vilhelm/finetune_output/" \ --overwrite_output_dir \ --evaluation_strategy=epoch \ --logging_strategy=epoch \ --per_device_train_batch_size=1 \ --per_device_eval_batch_size=1 \ --predict_with_generate \ --save_total_limit=5 \ --max_source_length=5 \ --max_target_length=5 \ --fp16=True \ --num_train_epochs=1 \ --save_strategy=epoch \ --load_best_model_at_end=True \ --metric_for_best_model=eval_rouge1 \ --ddp_find_unused_parameters=False \ --optim=adafactor \
I have been running it using 4 32GB V100 gpus at the Puhti supercomputer, on a single node.
@VilhelmHovland I believe the root of your troubles is this line:
--dataset_name="marksverdhei/wordnet-definitions-en-2021"
You are trying to use the Wordnet dataset directly as it is on HF. We didn't try that, and I doubt the fine-tuning script deals with this well. As mentioned before, we fine-tune on tab-separated files with two columns: examples
and definitions
, without directly using the datasets
library. This allows much more flexibility. You should point to the training and validation data files with these arguments:
--train_file ${TRAIN_DATASET} \
--validation_file ${VAL_DATASET} \
(see the example here)
Note that the examples should be already augmented with the instruction prompt ("What is the definition of TARGET_WORD?" or whatever prompt you are using).
@akutuzov I see, thank you. Is the exact data you used available anywhere, or do I need to process the CoDWoE and naacl data?
"naacl data" means datasets from Ishivatari et al 2019, right?
Then yes, you'll have to convert them to the tab-separated format I described above. Same with CoDWoE - it comes as json
files, but it's trivial to convert them to .tsv
.
We did not publish our converted versions, since we felt it would be not polite to re-distribute datasets created by others (simply saved in another format). Again, it should be trivial convert these datasets to .tsv
and add the instruction prompt.
If you encounter any difficulties with that, get in touch with me, I'll share our preprocessed files privately.
Hello again, I have now changed my data, but I am still getting the same error. I am using the same parameters except with direct data files. I formatted them like this, in .tsv files, does it look correct? What else could be causing issues?
example definition cranial pressure What is the definition of cranial? of or relating to the cranium which encloses the brain an easy job What is the definition of easy? posing no difficulty
@VilhelmHovland did you try to fine-tune a smaller model (flan-t5-base
, foe example), and/or removing the --fp16=True
argument?
@VilhelmHovland I've just tried to fine-tune the flan-t5-base
model on the few lines you quoted above. I repeated them multiple times, so that in the end I got a file with 12 instances (the file is here).
On this toy dataset, fine-tuning with batch size 4 and 2 epochs completed without any issues. I used one A100 GPU with 40GB of RAM. Here is the exact command:
python3 finetune_flan.py \
--model_name_or_path google/flan-t5-base \
--do_train \
--do_eval \
--train_file example_dataset.tsv \
--validation_file example_dataset.tsv \
--output_dir test_model \
--overwrite_output_dir \
--evaluation_strategy=epoch \
--logging_strategy=epoch \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--predict_with_generate \
--save_total_limit=5 \
--max_source_length=192 \
--max_target_length=128 \
--bf16=False \
--num_train_epochs=2 \
--save_strategy=epoch \
--load_best_model_at_end=True \
--metric_for_best_model=eval_rouge1 \
--ddp_find_unused_parameters=False \
--optim=adafactor \
--report_to=none \
Okay, I tried as well, it does work now, thank you. What would be the bottleneck for finetuning the larger models then? Is there any way I could get it to work for those as well?
Well, the usual procedure: set the per-device batch size to 1, and then increase it until you hit out-of-memory error again. This will be your ceiling in terms of RAM. Often, you can increase the batch size even more by using gradient accumulation (at the cost of slower training). Using more than one GPU (within one node) will also naturally allow you to have a larger global batch size, which is usually a good thing.
Hello, I want to try finetuning your model with own data but I have two questions:
Thank you for any assistance here.