Closed fseasy closed 4 years ago
Hi, I did encounter this problem, though most of the time it works fine. What you said about distributed data loading may be the reason to this, and I am wondering if what you said is true, does this mean if we set -train_steps to a very large number, this problem will not occur? (since no GPU can reach the end of all data)
For the reproduction part, could you be more specified about the setting.
Thanks for response!
Finetune:
set -x
VISIBLE_GPUS="4,5,6,7"
# Totally align!!
python src/train.py \
--task abs \
--mode train \
--vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \
--config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \
--bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \
--bert_data_path data/bert_data_cnndm_final/cnndm \
--log_file output/run2/train_abs_full_align.log \
--model_path output/run2/model_full_align \
--sep_optim true \
--lr_bert 0.002 \
--lr_dec 0.2 \
--dec_dropout 0.2 \
--train_steps 200000 \
--save_checkpoint_steps 2000 \
--warmup_steps_bert 20000 \
--warmup_steps_dec 10000 \
--batch_size 140 \
--accum_count 5 \
--report_every 50 \
--max_pos 512 \
--use_bert_emb true \
--use_interval true \
--visible_gpus $VISIBLE_GPUS \
--distributed_port 10008
Validate & Test
set -x
VISIBLE_GPUS="0"
python src/train.py \
--task abs \
--mode validate \
--test_all True \
--batch_size 3000 \
--test_batch_size 500 \
--vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \
--config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \
--bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \
--bert_data_path data/bert_data_cnndm_final/cnndm \
--log_file output/run2/validate_abs_big_batch_size.log \
--model_path output/run2/model_full_align \
--result_path output/run2/predict_model_full_align/cnn_dm \
--sep_optim true \
--use_interval true \
--visible_gpus "$VISIBLE_GPUS" \
--max_pos 512 \
--max_length 200 \
--alpha 0.95 \
--min_length 50
and got results:
validation top3:
(2.15112486930981, 'output/run2/model_full_align/model_step_140000.pt', 69),
(2.1529659433971706, 'output/run2/model_full_align/model_step_156000.pt', 77),
(2.1541826476652934, 'output/run2/model_full_align/model_step_138000.pt', 68)
test results of the validate-top3:
model_step_140000.pt:
ROUGE-F(1, 2, l): 41.49/19.05/38.56
ROUGE-R(1, 2, l): 46.26/21.18/42.97
ROUGE-P(1, 2, l): 39.68/18.29/36.90
model_step_156000.pt:
ROUGE-F(1, 2, l): 41.41/18.98/38.46
ROUGE-R(1, 2, l): 46.01/21.03/42.71
ROUGE-P(1, 2, l): 39.72/18.29/36.91
model_step_138000.pt:
ROUGE-F(1, 2, l): 41.50/19.11/38.60
ROUGE-R(1, 2, l): 45.76/21.02/42.55
ROUGE-P(1, 2, l): 40.11/18.55/37.32
AVG: ROUGE1, 2, L = 41.47, 19.05, 38.54
it is a bit lower than the paper report(41.72 , 19.39, 38.76).
What's more , I found the trigram-block
in the code it not same to UniLM
or fairseq
, the code it block the already-existed duplicated trigram(full token)
, but the other is to prevent duplicated trigram generating in the step.(sub-token)
. I haven't test the affect of this difference. It is appreciate if you know this.
Thanks! Can't reproduce really heart me, I don't how where is wrong...
-1. Oh, I'm wrong. the learning rate under noam
schedule is indeed not affected by the total step.
-3. Ok, I got it~
-2.
I checked the ROUGE-1.5.5.pl to files2rouge-rouge-perl, it have no diff, so ROUGE-perl is same and pyrouge I just use your code. The Hang code is actually using the pytorch1.1, so it should be aligned. I just running by the released script. From my view, I should keep all consistent.
The only I change is the validation script, I add --test_all True
to the options because I run validate after training finished.
from the code logic, the only difference is that test_all
will validate all checkpoints and keep the TOPK (code is seem 5, paper is 3) checkpoints and do test and report the ROUGE. While test_all == False
will do validate & test for every checkpoint.
I just want to ensure, the selection of top3 result's checkpoint, is based on validation
or test
? I found validation
and test
may have a bit difference in distribution because I may got higher ROUGE in test while it it worse in validation (large validation loss. => is may be also because of difference of loss and ROUGE?)
Thanks so much.
It is based on validation, a better rouge can be achieved if you select them by validation rouge instead of validation ppl. (though that's not what we did). The code had a slight change in this release compared to the paper version, which will reduce the real batch size, but I have made sure the results are almost similar. If you still cannot reproduce the results, you could try enlarging accum_count or batch_size.
Got it! I'll try to enlarge the batch size and total training step.
Finetune:
set -x VISIBLE_GPUS="4,5,6,7" # Totally align!! python src/train.py \ --task abs \ --mode train \ --vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \ --config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \ --bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \ --bert_data_path data/bert_data_cnndm_final/cnndm \ --log_file output/run2/train_abs_full_align.log \ --model_path output/run2/model_full_align \ --sep_optim true \ --lr_bert 0.002 \ --lr_dec 0.2 \ --dec_dropout 0.2 \ --train_steps 200000 \ --save_checkpoint_steps 2000 \ --warmup_steps_bert 20000 \ --warmup_steps_dec 10000 \ --batch_size 140 \ --accum_count 5 \ --report_every 50 \ --max_pos 512 \ --use_bert_emb true \ --use_interval true \ --visible_gpus $VISIBLE_GPUS \ --distributed_port 10008
Hi, I'm trying to run BertAbs as well. But in this repository only 4 trained models are provided:
So I'm wondering what you put in the path MODEL_PATH:
--model_path output/run2/model_full_align \
where did you download the model_full_align ?
thanks!
@matt9704 It's my own training result~
The MODEL_PATH
should be the output, not the input.
@matt9704 It's my own training result~ The
MODEL_PATH
should be the output, not the input.
Thanks! So if I want to finetune a BertAbs model, I can just set MODEL_PATH
as the path to an empty folder (to save the finetuned model). It that right???
May be not. if you want to train continuously based on the pretrained BertAbs, you need set the args train_from
, at here
https://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train.py#L107
and the loading logic is here
I think it's better to see the code if you want to know more beyond the README ;) @matt9704
May be not. if you want to train continuously based on the pretrained BertAbs, you need set the args
train_from
, at herehttps://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train.py#L107
and the loading logic is here
I think it's better to see the code if you want to know more beyond the README ;) @matt9704
Thanks, I'm just wondering where I can find the pretrained BertAbs, so that I can load it in and finetune it on my dataset...
ENV
Python: python 3.7.4 PyTorch: torch==1.1.0
Hang problem
When I running the BertAbs as the script of README, I got hanging. Specifically, using 4 V100 32G cards training the BertAbs, it hanged at nearly the end part: I run 2 times, they always hang at about 170K / 200K steps(not Exactly same). But when I use a bigger batch-size (140 x 4), or use 1 cards, it can finish correctly.
After read the code, I'm suspect the Dataloader for distributed multi-cards may cause Problem: The Dataload don't guarantee that every card will have same number of batch data, that is, in the Edge condition: the end of 1 epoch, the only left 3 batch data, so 3 cards get the data, but the 4th gpu-card haven't! so It will Running the Next Epoch and do Param-Sync with other cards which actually in previous Epoch. And according the time accumulating, the bias become bigger and bigger, and hang at some step (such as, 1 process has go over all data, while other process are stilling running, so the distributed sync will hang to wait the finished-process).
I'm just guessing... Currently I haven't test the suspecting.
Anyway, The actual reason may be other. I'm here mostly to ask Whether the Author or someone has encountered this condition?
What's more, I'm anxious because I can't reproduce the result of paper (one for this hanging condition, and under other not-hanging setting, the ROUGE1 is less than paper about 0.36) and I want to do some continuous work based on the result.
Please indicating/help me...