huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.69k stars 26.22k forks source link

BART fine-tuning doesn't work and produces a fixed output for each input #12237

Closed sajastu closed 3 years ago

sajastu commented 3 years ago

I'm getting stuck on fine-tuning BART model on reddit-tifu dataset. When I use a pre-trained model of BART, for example, bart-large-xsum without finetuning, it works fine and produces sort of sensible output for each input, but as I start finetuning it with BART, it starts to predict irrelevant text for each given input; as if it has been overfit to training data. Although, overfitting doesn't seem rational to me as the reddit has over 30k training samples. I'm wondering if there's any problem with my bash script or in the fine-tuning scripts? Since I've been using the instructions on https://github.com/huggingface/transformers/tree/master/examples/pytorch/summarization to this end. Following is my bash script for fine-tuning bart-large-xsum model.

DS_BASE_DIR=/home/code-base/user_space/packages/summarization_datasets/reddit_tifu/
python -m torch.distributed.launch --nproc_per_node=4 examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path facebook/bart-large-xsum \
    --do_train \
    --do_eval \
    --train_file $DS_BASE_DIR/train.json \
    --validation_file $DS_BASE_DIR/val.json \
    --output_dir /home/code-base/user_space/saved_models/bart/reddit-1024-tuned \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2  \
    --overwrite_output_dir \
    --predict_with_generate \
    --num_train_epochs 15 \
    --text_column text \
    --summary_column summary \
    --learning_rate 3e-5 \
    --weight_decay 0.01 \
    --adam_beta2 0.98 \
    --warmup_steps 5000 

I have used these hyperparams to match with the performance of https://arxiv.org/pdf/2008.03156v1.pdf

Outputs, after 1 training epoch:

input:

so this happened when i was in like third grade but it continued to bother me throughout high school. i had actually forgotten about this till i read one of the other posts on here. the original fuck up happened when as i said we were playing football in the backyard. his backyard was surrounded by a metal fence. we had decided to use the top of a hill right before the fence should be plenty of leeway before the fence right? wrong. i was running in for a touchdown had just gotten past my friend for the touchdown when he jumped and tangled up my legs. i ended up sliding down the hill and fell tooth first into his fence. somehow even though 2/3rds of my tooth was in the fence i managed to avoid all nerves and felt no pain. i came up laughing so hard i was crying which i think made it worse because my friend goes dude your tooth is missing. which of course made me laugh even harder. his mom hears the commotion and comes out sees my missing tooth and me crying and starts freaking out. she partially blamed herself because she's the one that sent us out because before we were just inside playing video games. my dad comes to pick me up she apologizes profusely and i still didn't think it was a big deal. this was on a saturday so we eventually get the dentist to come in on sunday, that place was awesome, to fix the tooth. since i'm so young they only put a temporary cap on. now i also played hockey, soccer and later lacrosse. of course the temporary cap didn't last all that long and came off. this happened several times and there were hockey games i'd start with the cap on lose it halfway through and would confuse everyone. i always had fun with this but it was getting old, and expensive, so eventually the dentist put on a permanent cap. haven't had a problem since. if you guys want i'll see if i can find the young picture of me without the tooth. edit: found it

fine-tuned bart prediction:

tried to impress a girl, ended up getting kicked out of the house by her dad and her mom for being late to a party.

input:

hi reddit. typical disclaimer, this didn't actually happen today. it happened a couple months ago, but it's still impacting me today. my kids are typical kids, they don't pick up their stuff and they get scolded for it. i was getting pretty sick of seeing their pok\u00e9mon cards lying all over the place because to me it looked like all of the money that came out of my pocket getting slowly turned into trash. my wife on the other hand went crazy because of the mess. one night it all came to a head. after weeks of ineffectually threatening to take the stupid cards away if they left them all over the floor, and my wife demanding that they clean the room before bedtime, she lost it when going in to tuck them in. i got tired of hearing it, so i went in, saw all of the expensive pok\u00e9mon cards strewn about and lost it too. i immediately started grabbing up all the cards and piling them into boxes then left the room with both arms full. i went stomping angrily through the living room to put them away in the front bedroom that i use for storage. that's when the f u happened. earlier that evening, my older child had noticed my younger child smearing chapstick all over a section of wood laminate flooring...

fine-tuned bart prediction:

tried to impress a girl, ended up getting kicked out of the house by her dad and her mom. i'm a dumbass.

Environment info

@patrickvonplaten, @patil-suraj, @sgugger

Information

Model I am using (Bert, XLNet ...): BART

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Running the above script which is taken from the official example
  2. After a few training steps, the model learns to predict a specific fixed output for each given input text.

Expected behavior

After fine-tuning for a few steps/epochs, I expect the model learn to generate at least different outputs for varying input texts.

@patrickvonplaten @patil-suraj

sajastu commented 3 years ago

Hi,

following up on this @patrickvonplaten and @patil-suraj

patrickvonplaten commented 3 years ago

Hey @sajastu,

It's pretty difficult for us to debug the script - from a first look, the hyperparameter settings look good to me. An effective batch size of 8 (4 * 2) seems rather small to me, but you can see from your loss curves whether 8 is enough I guess.

Also note that the x-sum dataset has some rather special distribution which is not really the same as reddit data IMO. X-sum is extreme summarization and has very dense sentences as summaries. Not sure if this works well with reddit.

sajastu commented 3 years ago

He @patrickvonplaten,

Thanks for your response!

The problem that I'm facing is that: when I'm running the generation phase of facebook/bart-large-xsum (i.e., without fine-tuning), I'm getting comparably high scores (22.43/ 7.21 / 17.65); however and interestingly, when I finetune it for a few training steps (let's say 10 training steps), and then run the fine-tuned model on the same test set, the scores get much much lower (15.32 / 2.35 / 9.78). This in fact doesn't make sense to me. Theoretically, I expect the scores to stay near to the main model, if not surpassing it, especially when it has been trained for a very few steps...

Do you have any thoughts on this? is this behaviour expectable?

Also, do you think that the model is overfitting, or get stuck in a local minimum that it's producing the same one output regardless of the input that it gets?

phhei commented 3 years ago

I struggle at the same point - the output of the generate-method in a fine-tuned BART seems to be independent of the input.

Interestingly, this holds only for the generate method. If I call the fine-tuned model directly, as with tokenizer.batch_decode(torch.argmax(model(input_ids = input_ids)[0], axis=-1)) the output is perfectly related to the input, hence, it differs from input to input. Therefore, I assume there is a bug in the BART.generate()-method, or to be more precise with my assumption, in the specific modeling_tf_bart.prepare_inputs_for_generation(). I tried to verify my assumption ( I guess fine-tuning freezes somehow the past-/-cache-value which disconnects the output from the input), but I don't find the point which triggers this special generate-method-behavouir.

sajastu commented 3 years ago

Hi @phhei,

I think the code is probably correct. Or if any flaw, it must exist in the tokenization module, since I'm not getting this "fixed" output on other datasets that I've been using to fine-tune BART. For my special case here, I changed the dataset (i.e., reddit_tifu), ran the same code, and finally able to get it working.

@patrickvonplaten might be of some help here.

phhei commented 3 years ago

Hi @sajastu,

thanks for your reply. However, if the tokenization-module would cause this behavior, then tokenizer.batch_decode(torch.argmax(model(input_ids = input_ids)[0], axis=-1)) (in which input_ids is generated by the tokenizer.encode-method - the same variable I use for the BART.generate(input_ids)-method) would also output always the same. I already investigated the raw tensor output of both approaches, and there is the same: the generate(input_ids)-method always produces the same tensor, torch.argmax(model(input_ids = input_ids)[0], axis=-1) depended on the input_ids.

I'm asking myself why changing the dataset (without anything else) would solve this issue. In my case, I have a non-huggingface-dataset, preprocessed by tokenizer-calls, so a bug in a huggingface-dataset is therefore not the point, too.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.