salesforce / ctrl-sum

Resources for the "CTRLsum: Towards Generic Controllable Text Summarization" paper
https://arxiv.org/abs/2012.04281
BSD 3-Clause "New" or "Revised" License
146 stars 24 forks source link

Qs on finetuning by myself on CNN dataset #4

Closed shakeley closed 3 years ago

shakeley commented 3 years ago

❓ Questions on finetuing

I finetune the model by myself (from fairseq BART.large ckpt) using train_bart.sh in the repo and src=oraclewordsource, and got a quite strange ROUGE score compared to that of the released ckpt.

Code

I use 4 V100 GPUs so I change the update_freq to 16 to fit the original effective batch size (1024 x 8GPU x 8update_freq). The other parameters for finetuning are not changed. The exact train_bart.sh I used is as follows:

DATE=`date +%Y%m%d`
data_bin="cnndm"
dropout=0.1
label_smoothing=0.1
GPU=2,3,4,7
train_steps=30000
warmup_updates=500
lr=3e-05
src='oraclewordsource'
cstring=''
tgt='target'
update_freq=16  # 8 for 8 GPUs
max_tokens=1024
save_interval_updates=2000
keep_interval_updates=1
log_interval=200

criterion='label_smoothed_cross_entropy'
checkpoint="checkpoint_best.pt"

...

export CUDA_VISIBLE_DEVICES=${GPU} 
fairseq-train data-bin/${data_bin} \
    --restore-file ${restore_file} \
    --max-tokens ${max_tokens} \
    --task translation \
    --source-lang ${src} \
    --target-lang ${tgt} \
    --truncate-source \
    --layernorm-embedding \
    --share-all-embeddings \
    --share-decoder-input-output-embed \
    --required-batch-size-multiple 1 \
    --arch bart_large \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --dropout 0.1 --attention-dropout 0.1 \
    --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \
    --clip-norm 0.1 \
    --lr-scheduler polynomial_decay --lr ${lr} --total-num-update ${train_steps} --warmup-updates ${warmup_updates} \
    --max-update ${train_steps} \
    --update-freq ${update_freq} \
    --skip-invalid-size-inputs-valid-test \
    --find-unused-parameters \
    --log-format simple --log-interval ${log_interval} \
    --best-checkpoint-metric ppl \
    --save-dir ${SAVE} \
    --save-interval-updates ${save_interval_updates} --tensorboard-logdir ${TENSORBOARD}\
    --validate-interval 1000 --keep-interval-updates ${keep_interval_updates} --save-interval 1000 --no-epoch-checkpoints \
    ${add_load_string} \
    | tee -a ${SAVE}/stdout.log

The score on valid set

After 40k steps (actually not necessary, thx for comment) finetuning on CNN, I got a score on valid set as follows:

--------------------------------------------- 1 ROUGE-1 Average_R: 0.46306 (95%-conf.int. 0.46130 - 0.46489) 1 ROUGE-1 Average_P: 0.46201 (95%-conf.int. 0.46011 - 0.46389) 1 ROUGE-1 Average_F: 0.45487 (95%-conf.int. 0.45328 - 0.45639) --------------------------------------------- 1 ROUGE-2 Average_R: 0.16004 (95%-conf.int. 0.15858 - 0.16146) 1 ROUGE-2 Average_P: 0.15922 (95%-conf.int. 0.15780 - 0.16062) 1 ROUGE-2 Average_F: 0.15693 (95%-conf.int. 0.15560 - 0.15829) --------------------------------------------- 1 ROUGE-L Average_R: 0.42004 (95%-conf.int. 0.41827 - 0.42180) 1 ROUGE-L Average_P: 0.41830 (95%-conf.int. 0.41649 - 0.42013) 1 ROUGE-L Average_F: 0.41230 (95%-conf.int. 0.41076 - 0.41381)

The score of the released ckpt from the repo I obtained is:

--------------------------------------------- 1 ROUGE-1 Average_R: 0.65854 (95%-conf.int. 0.65601 - 0.66100) 1 ROUGE-1 Average_P: 0.58624 (95%-conf.int. 0.58347 - 0.58886) 1 ROUGE-1 Average_F: 0.60919 (95%-conf.int. 0.60702 - 0.61135) --------------------------------------------- 1 ROUGE-2 Average_R: 0.39357 (95%-conf.int. 0.39065 - 0.39653) 1 ROUGE-2 Average_P: 0.35346 (95%-conf.int. 0.35042 - 0.35649) 1 ROUGE-2 Average_F: 0.36590 (95%-conf.int. 0.36306 - 0.36879) --------------------------------------------- 1 ROUGE-L Average_R: 0.62027 (95%-conf.int. 0.61776 - 0.62274) 1 ROUGE-L Average_P: 0.55265 (95%-conf.int. 0.54987 - 0.55533) 1 ROUGE-L Average_F: 0.57412 (95%-conf.int. 0.57185 - 0.57647)

The difference made me very confused. I must miss some important details.

I notice that the released tar.gz contains some extra files like dict.extwordssourcetrunclead.txt, dict.targettrunclead.txt, which are used for ckpt evaluation but not for my own finetuning seemingly. Is this one of the reasons for my problems? What are the two txt files?

It would be very kind of you to help me. Thanks!

shakeley commented 3 years ago

My finetuning process is as follows. The blue line represents valid.

loss ppl

jxhe commented 3 years ago

Hi,

The dict.extwordssourcetrunclead.txt, dict.targettrunclead.txt files are the vocab files used by the model, which should be identical to your datasets/cnndm/dict.txt if you process the dataset correctly.

Your training curve does seem incorrect though -- our validation ppl converges to ~3 finally but your ppl seems much larger, can you post your log file here as well?

BTW, you don't need to double train_steps, it is the number of update steps thus unrelated to update_freq (This is a small point, just FYI to save training time)

shakeley commented 3 years ago

Thx for the explanations! It seems that I only have a stdout.log, which contains some infomation about the training process.

jxhe commented 3 years ago

I think I found the issue -- the log says the bart checkpoint is not loaded:

2021-02-28 10:47:44 | INFO | fairseq.trainer | no existing checkpoint found /home/kelixie/ctrlsum/bart.large

The bart checkpoint path passed to the script should be a .pt file instead of a directory.

shakeley commented 3 years ago

Ohhhh... That's a careless mistake :( Grateful for your patient and detailed reply. I can reproduce the results now!