Closed astariul closed 3 years ago
Are the outputs produced by your best-checkpoint after fine-tuning producing proper outputs? or are the truncated at the end? I did fine-tune t5-small on CNN/DM but the best-checkpoint was producing outputs which were truncated in the end(for sample output, I just raised an issue, refer to that) and this was leading to reduced R1 scores too. Just wanted to know if you faced the same issue or if not what might be the reason for it, as I couldn't find why.
Thanks.
@cola I haven't tried finetuning bart-large. Could take a pass if you have a command you are running that I can reproduce. Without code, I can speculate on ideas but I can't check if you are already doing them, so sorry if this is useless:
(1) @tromedlov22 's idea reminds me that you should make sure you set config.task_specific_params
def use_task_specific_params(model, task):
# update config with summarization specific params
task_specific_params = model.config.task_specific_params
if task_specific_params is not None:
model.config.update(task_specific_params.get(task, {}))
use_task_specific_params(model, 'summarization')
(2)
Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what finetune.py
) and that might help results.
For reference, the params I see are:
{'early_stopping': True,
'length_penalty': 2.0,
'max_length': 142,
'min_length': 56,
'no_repeat_ngram_size': 3,
'num_beams': 4}}
(3) IIRC, authors use label_smoothing_cross_entropy
do you?
(4) for cnn, truncation parameters matter on the target side.
(5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to use AutoConfig.from_pretrained('facebook/bart-large-xsum')
params) You could also use wandb and then share your logs, which would allow me to give better advice.
@tromedlov22 Thanks for the answer. I checked but the answer seems fine, not truncated at the end. I guess we are having different problem.
@sshleifer Thanks for the very detailed answer !
I can't give you a one-command for reproducing, I modified the example code to add missing details from the Fairseq repo, such as label-smoothing
!
(3) IIRC, authors use label_smoothing_cross_entropy do you?
Yes I do
Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what finetune.py) and that might help results.
Indeed I'm saving only at the end of training. I will try that.
(5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to use AutoConfig.from_pretrained('facebook/bart-large-xsum') params) You could also use wandb and then share your logs, which would allow me to give better advice.
Thanks for the advice !
(4) for cnn, truncation parameters matter on the target side.
What do you mean ?
That would be a very useful PR @cola !
I could improve a my results by using early-stopping, thank you very much for the idea @sshleifer !
Now I have 43.68 as R1. Almost 44.16 from the paper !
I'm trying to find what can cause this small difference, and I would love to hear your opinion about this :
I'm training with batch-size 1 (I can't fit more in my 16Gb memory). The authors fine-tuned it with batch-size 2 (with 32Gb memory).
Can it come from here ? Does the layer batch-normalization act differently with single-samples batch for example ?
I'm in a similar place with machine translation. The things I know to be different from fairseq are:
gradient_accumulation_steps
?if you have all those squared away, the only other thing I can think of is that the embeddings (we use model.model.shared
, they don't) somehow become untied or get different gradients.
Let me know if any of these have mattered, cause I'm trying to prioritize what to implement in transformers
Here is what I did so far :
Implementing the first one seems complicated, so I didn't try.
Thanks for the help, the detailed list of things to try is awesome !
So far I'm satisfied with the results, it's really close to the paper's results. Maybe some tiny difference in the code is responsible for the difference ? If I have more time I will try the other things I didn't try so far :)
I am having similar problems with this myself. @Colanim do you know which if your above changes had the largest impact so I can begin with those?
@sshleifer I think there is a bug with label_smoothed_nll_loss
. I have tried using it with current master and I am getting infinite losses because the bs
term is zero and this is the denominator in line 45 (return loss / bs, nll_loss / bs
).
wowo great catch this line I wrote is broken in so many ways:
bs = pad_mask.long().sum() # pad mask has 1 where labels.eq(pad_token_id). This is num pad tokens in the batch....
I would delete the denominator if I were you.
In my experience: warmup_updates can help a lot, as well as playing with gradient_accumulation_batches. (more for MT, lower -> better). But interested in @Colanim 's experience.
BTW, thanks to @stas00 you can now pass --dropout
, --attention_dropout
, --decoder_layerdrop
, and --encoder_layerdrop
through the command line.
@Colanim can you rerun evaluation on your 43.68 R1 model? I hope that #6526 might have helped close the gap! It doesn't help for bart-large-cnn, but it does help bart-large-xsum.
Will try as soon as I can ! I have to find my checkpoint... ^^
What command are you using @Colanim ? I get OOM even with BS=1 on a 32GB v100 GPU. @sshleifer
python finetune.py \
--data_dir=data/cnn_dm/ \
--output_dir=${RESULTS_DIR} \
--learning_rate=3e-5 \
--fp16 \
--gpus 8 \
--do_train \
--do_predict \
--n_val 1000 \
--val_check_interval 0.1 \
--train_batch_size=1 --gradient_accumulation_steps=4 \
--eval_batch_size=1 \
--max_steps 20000 --warmup_steps=500 \
--eval_max_gen_length=142 --max_source_length=1042 --max_target_length=56 \
--sortish_sampler \
--lr_scheduler polynomial \
--label_smoothing 0.1 \
--weight_decay 0.01 \
--dropout 0.1 --attention_dropout 0.1 --gradient_clip_val=0.1 --early_stop_callback=1
and initializing model without autoconfig as
config = BartConfig(**json.load(open(args.config_path, "r")))
model = BartForConditionalGeneration(config)
tokenizer = BartTokenizer.from_pretrained(
'facebook/bart-large-cnn') # Downloads vocab and merges file automatically
Try --num_sanity_val_steps=0 --eval_beams 2
model= BartForConditionalGeneration.from_pretrained('facebook/bart-large')
this will do better than random init.That works initially but fails after ~15k steps - what eval_max_gen_length are you using? not sure if you froze embeds as mentioned in #6711 for BART CNN/DM as well.
Traceback (most recent call last):
File "finetune.py", line 446, in <module>
main(args)
File "finetune.py", line 421, in main
logger=logger,
File "/workspace/bart/lightning_base.py", line 369, in generic_train
trainer.fit(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in fit
results = self.accelerator_backend.spawn_ddp_children(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 123, in spawn_ddp_children
results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/accelerators/ddp_backend.py", line 224, in ddp_train
results = self.trainer.run_pretrain_routine(model)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
self.run_evaluation(test_mode=False)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 331, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 661, in evaluation_forward
output = model(*args)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/pytorch_lightning/overrides/data_parallel.py", line 174, in forward
output = self.module.validation_step(*inputs[0], **kwargs[0])
File "finetune.py", line 175, in validation_step
return self._generative_step(batch)
File "finetune.py", line 218, in _generative_step
max_length=self.eval_max_length,
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
return func(*args, **kwargs)
File "/workspace/bart/generation_utils.py", line 469, in generate
model_specific_kwargs=model_specific_kwargs,
File "/workspace/bart/generation_utils.py", line 648, in _generate_beam_search
outputs = self(**model_inputs) # (batch_size * num_beams, cur_len, vocab_size)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/bart/modeling_bart.py", line 1037, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/bart/modeling_bart.py", line 909, in forward
return_dict=return_dict,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/bart/modeling_bart.py", line 570, in forward
output_attentions=output_attentions,
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/bart/modeling_bart.py", line 443, in forward
x = self.activation_fn(self.fc1(x))
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 577, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1676, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 14.56 GiB already allocated; 11.44 MiB free; 14.79 GiB reserved in total by PyTorch)
Definitely use --freeze_embeds
. I have never seen it hurt metrics. I have actually never tried to finetune on cnn_dm, but interested to hear your results!
Still OOMs even with eval_beams=1. #7004 works for me, will confirm when I have it working e2e
Unfortunately I'm not working with BART anymore these days... I didn't try more experiments
Hi, @Colanim , could you share you eval script that get a score of 44.09 with facebook/bart-large-cnn? Thanks!
Basically I use nlp
package to get the cnn_dm
data, then run generation with :
preds = model.generate(samples['article'],
num_beams=4, length_penalty=2,
max_length=142, min_length=56,
early_stopping=True,
no_repeat_ngram_size=3)
and save the predictions and gold in text files. Then use the files2rouge
package to get ROUGE scores.
Also don't forget to tokenize the predictions and gold with StanFord CoreNLP
!
Hi, @Colanim I tried to reproduce the paper's results from the checkpoint facebook/bart-large-cnn, but somehow my rouge1 score is only 42.62. I tried the following steps, could you help me to find out what's wrong? Thanks! infer:
from transformers import BartTokenizer, BartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
source_pwd='./test.source'
input_sents=open(source_pwd,'r',encoding='utf8').readlines()
with open('./test.pred','w',encoding='utf8') as out:
inputs = tokenizer(input_sents, max_length=1024, return_tensors='pt',truncation=True,padding=True)
summary_ids = model.generate(inputs['input_ids'], num_beams=4, length_penalty=2,max_length=142, min_length=56,early_stopping=True,no_repeat_ngram_size=3)
for summary_id in summary_ids:
out.write(tokenizer.decode(summary_id, skip_special_tokens=True, clean_up_tokenization_spaces=False).strip()+'\n')
eval: cat test.target | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.target.tokenized cat test.pred | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.pred.tokenized files2rouge test.pred.tokenized test.target.tokenized
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@Cola I haven't tried finetuning bart-large. Could take a pass if you have a command you are running that I can reproduce. Without code, I can speculate on ideas but I can't check if you are already doing them, so sorry if this is useless:
(1) @tromedlov22 's idea reminds me that you should make sure you set config.task_specific_params
def use_task_specific_params(model, task): # update config with summarization specific params task_specific_params = model.config.task_specific_params if task_specific_params is not None: model.config.update(task_specific_params.get(task, {})) use_task_specific_params(model, 'summarization')
(2) Another idea, I suspect the authors checked rouge every epoch and stopped at the best validation rouge, (roughly what
finetune.py
) and that might help results.For reference, the params I see are:
{'early_stopping': True, 'length_penalty': 2.0, 'max_length': 142, 'min_length': 56, 'no_repeat_ngram_size': 3, 'num_beams': 4}}
(3) IIRC, authors use
label_smoothing_cross_entropy
do you? (4) for cnn, truncation parameters matter on the target side. (5) if you are purely interested in reproducing finetuning performance, I would experiment with xsum since it trains 30% faster than cnn (shorter targets). (and make sure to useAutoConfig.from_pretrained('facebook/bart-large-xsum')
params) You could also use wandb and then share your logs, which would allow me to give better advice.
Hi @sshleifer, I'm trying to test the best fine-tuned SUMM model on CNNDM dataset. But seems like I need to use args.use_task_specific_params, but can't use it by simply add --task_specific_params. Is there a solution for that?
❓ Help
I'm trying to fine-tune BART on CNN/DM by myself (so, starting from
facebook/bart-large
checkpoint).However I can't reproduce the results so far... BART authors report a R1 score of
44.16
in their paper, but my best checkpoint so far is only42.53
.It's not an issue with the eval script, as I can reproduce the authors' results from the checkpoint
facebook/bart-large-cnn
. I get a score of44.09
using this checkpoint.I tried several hyper-parameters : the ones provided in the example folder, but also the ones used in fairseq repo. It doesn't change anything...
I'm a bit at loss on how to reproduce these fine-tuning score...
Could anyone fine-tune BART successfully using
transformers
repo ? If yes, can you share your parameters ? Any help would be greatly appreciated !@sshleifer