astariul commented 4 years ago

❓ Help

I'm trying to fine-tune BART on CNN/DM by myself (so, starting from facebook/bart-large checkpoint).

However I can't reproduce the results so far... BART authors report a R1 score of 44.16 in their paper, but my best checkpoint so far is only 42.53.

It's not an issue with the eval script, as I can reproduce the authors' results from the checkpoint facebook/bart-large-cnn. I get a score of 44.09 using this checkpoint.

I tried several hyper-parameters : the ones provided in the example folder, but also the ones used in fairseq repo. It doesn't change anything...

I'm a bit at loss on how to reproduce these fine-tuning score...
Could anyone fine-tune BART successfully using transformers repo ? If yes, can you share your parameters ? Any help would be greatly appreciated !

@sshleifer

manojpreveen commented 4 years ago

Are the outputs produced by your best-checkpoint after fine-tuning producing proper outputs? or are the truncated at the end? I did fine-tune t5-small on CNN/DM but the best-checkpoint was producing outputs which were truncated in the end(for sample output, I just raised an issue, refer to that) and this was leading to reduced R1 scores too. Just wanted to know if you faced the same issue or if not what might be the reason for it, as I couldn't find why.

Thanks.

sshleifer commented 4 years ago

@cola I haven't tried finetuning bart-large. Could take a pass if you have a command you are running that I can reproduce. Without code, I can speculate on ideas but I can't check if you are already doing them, so sorry if this is useless:

(1) @tromedlov22 's idea reminds me that you should make sure you set config.task_specific_params

def use_task_specific_params(model, task):
    # update config with summarization specific params
    task_specific_params = model.config.task_specific_params
    if task_specific_params is not None:
        model.config.update(task_specific_params.get(task, {}))
use_task_specific_params(model, 'summarization')

(2) Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what finetune.py) and that might help results.

For reference, the params I see are:

 {'early_stopping': True,
  'length_penalty': 2.0,
  'max_length': 142,
  'min_length': 56,
  'no_repeat_ngram_size': 3,
  'num_beams': 4}}

(3) IIRC, authors use label_smoothing_cross_entropy do you? (4) for cnn, truncation parameters matter on the target side. (5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to use AutoConfig.from_pretrained('facebook/bart-large-xsum') params) You could also use wandb and then share your logs, which would allow me to give better advice.

astariul commented 4 years ago

@tromedlov22 Thanks for the answer. I checked but the answer seems fine, not truncated at the end. I guess we are having different problem.

@sshleifer Thanks for the very detailed answer ! I can't give you a one-command for reproducing, I modified the example code to add missing details from the Fairseq repo, such as label-smoothing !

(3) IIRC, authors use label_smoothing_cross_entropy do you?

Yes I do

Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what finetune.py) and that might help results.

Indeed I'm saving only at the end of training. I will try that.

(5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to use AutoConfig.from_pretrained('facebook/bart-large-xsum') params) You could also use wandb and then share your logs, which would allow me to give better advice.

Thanks for the advice !

(4) for cnn, truncation parameters matter on the target side.

What do you mean ?

sshleifer commented 4 years ago

That would be a very useful PR @cola !

astariul commented 4 years ago

I could improve a my results by using early-stopping, thank you very much for the idea @sshleifer !

Now I have 43.68 as R1. Almost 44.16 from the paper !

I'm trying to find what can cause this small difference, and I would love to hear your opinion about this :

I'm training with batch-size 1 (I can't fit more in my 16Gb memory). The authors fine-tuned it with batch-size 2 (with 32Gb memory).

Can it come from here ? Does the layer batch-normalization act differently with single-samples batch for example ?

sshleifer commented 4 years ago

I'm in a similar place with machine translation. The things I know to be different from fairseq are:

[ ] (probably only matters for MT) their dataloader creates 1 batch for every N tokens.
[ ] dropout, attention_dropout (need to be set through config)
[ ] weight_decay = 0.1
[ ] adam_betas
[ ] lr_scheduler=polynomial_decay
[ ] warmup_updates
[ ] Did you figure out whether update_freq is the same as gradient_accumulation_steps?

if you have all those squared away, the only other thing I can think of is that the embeddings (we use model.model.shared , they don't) somehow become untied or get different gradients.

Let me know if any of these have mattered, cause I'm trying to prioritize what to implement in transformers

astariul commented 4 years ago

Here is what I did so far :

[ ] (probably only matters for MT) their dataloader creates 1 batch for every N tokens.
[x] dropout, attention_dropout (need to be set through config)
[x] weight_decay = 0.1
[ ] adam_betas
[x] lr_scheduler=polynomial_decay
[x] warmup_updates
[ ] Did you figure out whether update_freq is the same as gradient_accumulation_steps?

Implementing the first one seems complicated, so I didn't try.

Thanks for the help, the detailed list of things to try is awesome !

So far I'm satisfied with the results, it's really close to the paper's results. Maybe some tiny difference in the code is responsible for the difference ? If I have more time I will try the other things I didn't try so far :)

alexgaskell10 commented 4 years ago

I am having similar problems with this myself. @Colanim do you know which if your above changes had the largest impact so I can begin with those?

@sshleifer I think there is a bug with label_smoothed_nll_loss. I have tried using it with current master and I am getting infinite losses because the bs term is zero and this is the denominator in line 45 (return loss / bs, nll_loss / bs).

sshleifer commented 4 years ago

wowo great catch this line I wrote is broken in so many ways:

bs = pad_mask.long().sum()  # pad mask has 1 where labels.eq(pad_token_id). This is num pad tokens in the batch....

I would delete the denominator if I were you.

In my experience: warmup_updates can help a lot, as well as playing with gradient_accumulation_batches. (more for MT, lower -> better). But interested in @Colanim 's experience.

BTW, thanks to @stas00 you can now pass --dropout, --attention_dropout, --decoder_layerdrop, and --encoder_layerdrop through the command line.

sshleifer commented 4 years ago

@Colanim can you rerun evaluation on your 43.68 R1 model? I hope that #6526 might have helped close the gap! It doesn't help for bart-large-cnn, but it does help bart-large-xsum.

astariul commented 4 years ago

Will try as soon as I can ! I have to find my checkpoint... ^^

swethmandava commented 4 years ago

What command are you using @Colanim ? I get OOM even with BS=1 on a 32GB v100 GPU. @sshleifer

python finetune.py \
    --data_dir=data/cnn_dm/ \
    --output_dir=${RESULTS_DIR} \
    --learning_rate=3e-5 \
    --fp16 \
    --gpus 8 \
    --do_train \
    --do_predict \
    --n_val 1000 \
    --val_check_interval 0.1 \
    --train_batch_size=1 --gradient_accumulation_steps=4 \
    --eval_batch_size=1 \
    --max_steps 20000 --warmup_steps=500 \
    --eval_max_gen_length=142 --max_source_length=1042 --max_target_length=56 \
    --sortish_sampler \
    --lr_scheduler polynomial \
    --label_smoothing 0.1 \
    --weight_decay 0.01 \
    --dropout 0.1 --attention_dropout 0.1 --gradient_clip_val=0.1 --early_stop_callback=1

and initializing model without autoconfig as

            config = BartConfig(**json.load(open(args.config_path, "r")))
            model = BartForConditionalGeneration(config)
            tokenizer = BartTokenizer.from_pretrained(
                'facebook/bart-large-cnn')  # Downloads vocab and merges file automatically

sshleifer commented 4 years ago

Try --num_sanity_val_steps=0 --eval_beams 2
Cola is starting with model= BartForConditionalGeneration.from_pretrained('facebook/bart-large') this will do better than random init.

swethmandava commented 4 years ago

That works initially but fails after ~15k steps - what eval_max_gen_length are you using? not sure if you froze embeds as mentioned in #6711 for BART CNN/DM as well.

Traceback (most recent call last):
  File "finetune.py", line 446, in <module>
    main(args)
  File "finetune.py", line 421, in main
    logger=logger,
  File "/workspace/bart/lightning_base.py", line 369, in generic_train
    trainer.fit(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 331, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
    output = model(*args)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 174, in forward
    output = self.module.validation_step(*inputs[0], **kwargs[0])
  File "finetune.py", line 175, in validation_step
    return self._generative_step(batch)
  File "finetune.py", line 218, in _generative_step
    max_length=self.eval_max_length,
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/bart/generation_utils.py", line 469, in generate
    model_specific_kwargs=model_specific_kwargs,
  File "/workspace/bart/generation_utils.py", line 648, in _generate_beam_search
    outputs = self(**model_inputs)  # (batch_size * num_beams, cur_len, vocab_size)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/bart/modeling_bart.py", line 1037, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/bart/modeling_bart.py", line 909, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/bart/modeling_bart.py", line 570, in forward
    output_attentions=output_attentions,
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/bart/modeling_bart.py", line 443, in forward
    x = self.activation_fn(self.fc1(x))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1676, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.56 GiB already allocated; 11.44 MiB free; 14.79 GiB reserved in total by PyTorch)

sshleifer commented 4 years ago

Definitely use --freeze_embeds. I have never seen it hurt metrics. I have actually never tried to finetune on cnn_dm, but interested to hear your results!

swethmandava commented 4 years ago

Still OOMs even with eval_beams=1. #7004 works for me, will confirm when I have it working e2e

astariul commented 4 years ago

Unfortunately I'm not working with BART anymore these days... I didn't try more experiments

JJJJane commented 3 years ago

Hi, @Colanim , could you share you eval script that get a score of 44.09 with facebook/bart-large-cnn? Thanks!

astariul commented 3 years ago

Basically I use nlp package to get the cnn_dm data, then run generation with :

preds = model.generate(samples['article'],
                                   num_beams=4, length_penalty=2,
                                   max_length=142, min_length=56,
                                   early_stopping=True,
                                   no_repeat_ngram_size=3)

and save the predictions and gold in text files. Then use the files2rouge package to get ROUGE scores.

Also don't forget to tokenize the predictions and gold with StanFord CoreNLP !

JJJJane commented 3 years ago

Hi, @Colanim I tried to reproduce the paper's results from the checkpoint facebook/bart-large-cnn, but somehow my rouge1 score is only 42.62. I tried the following steps, could you help me to find out what's wrong? Thanks! infer:

from transformers import BartTokenizer, BartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
source_pwd='./test.source'
input_sents=open(source_pwd,'r',encoding='utf8').readlines()
with open('./test.pred','w',encoding='utf8') as out:
    inputs = tokenizer(input_sents, max_length=1024, return_tensors='pt',truncation=True,padding=True)
    summary_ids = model.generate(inputs['input_ids'], num_beams=4, length_penalty=2,max_length=142, min_length=56,early_stopping=True,no_repeat_ngram_size=3)
    for summary_id in summary_ids:
        out.write(tokenizer.decode(summary_id, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()+'\n')

eval: cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.target.tokenized cat test.pred | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.pred.tokenized files2rouge test.pred.tokenized test.target.tokenized

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zyxnlp commented 3 years ago

@Cola I haven't tried finetuning bart-large. Could take a pass if you have a command you are running that I can reproduce. Without code, I can speculate on ideas but I can't check if you are already doing them, so sorry if this is useless:

(1) @tromedlov22 's idea reminds me that you should make sure you set config.task_specific_params
def use_task_specific_params(model, task):
    # update config with summarization specific params
    task_specific_params = model.config.task_specific_params
    if task_specific_params is not None:
        model.config.update(task_specific_params.get(task, {}))
use_task_specific_params(model, 'summarization')
(2) Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what finetune.py) and that might help results.

For reference, the params I see are:
 {'early_stopping': True,
  'length_penalty': 2.0,
  'max_length': 142,
  'min_length': 56,
  'no_repeat_ngram_size': 3,
  'num_beams': 4}}
(3) IIRC, authors use label_smoothing_cross_entropy do you? (4) for cnn, truncation parameters matter on the target side. (5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to use AutoConfig.from_pretrained('facebook/bart-large-xsum') params) You could also use wandb and then share your logs, which would allow me to give better advice.

Hi @sshleifer, I'm trying to test the best fine-tuned SUMM model on CNNDM dataset. But seems like I need to use args.use_task_specific_params, but can't use it by simply add --task_specific_params. Is there a solution for that?

huggingface / transformers

❓ Difficulties to reproduce BART results on CNN/DM by fine-tuning bart-large #5654

❓ Help