fseasy commented 4 years ago

ENV

Python: python 3.7.4 PyTorch: torch==1.1.0

Hang problem

When I running the BertAbs as the script of README, I got hanging. Specifically, using 4 V100 32G cards training the BertAbs, it hanged at nearly the end part: I run 2 times, they always hang at about 170K / 200K steps(not Exactly same). But when I use a bigger batch-size (140 x 4), or use 1 cards, it can finish correctly.

After read the code, I'm suspect the Dataloader for distributed multi-cards may cause Problem: The Dataload don't guarantee that every card will have same number of batch data, that is, in the Edge condition: the end of 1 epoch, the only left 3 batch data, so 3 cards get the data, but the 4th gpu-card haven't! so It will Running the Next Epoch and do Param-Sync with other cards which actually in previous Epoch. And according the time accumulating, the bias become bigger and bigger, and hang at some step (such as, 1 process has go over all data, while other process are stilling running, so the distributed sync will hang to wait the finished-process).

I'm just guessing... Currently I haven't test the suspecting.

Anyway, The actual reason may be other. I'm here mostly to ask Whether the Author or someone has encountered this condition?

What's more, I'm anxious because I can't reproduce the result of paper (one for this hanging condition, and under other not-hanging setting, the ROUGE1 is less than paper about 0.36) and I want to do some continuous work based on the result.

Please indicating/help me...

nlpyang commented 4 years ago

Hi, I did encounter this problem, though most of the time it works fine. What you said about distributed data loading may be the reason to this, and I am wondering if what you said is true, does this mean if we set -train_steps to a very large number, this problem will not occur? (since no GPU can reach the end of all data)

For the reproduction part, could you be more specified about the setting.

fseasy commented 4 years ago

Thanks for response!

I haven't test this , but after I changing the data loader to a infinity multiprocessing.Queue based data provider, it didn't hang (But I also upgrade the PyTorch from 1.1 -> 1.4); Enlarge the train steps will change the learning rate schedule, I am afraid it may change the result.
I totally running as your scripts:

Finetune:

set -x

VISIBLE_GPUS="4,5,6,7"

# Totally align!!

python src/train.py \
    --task abs \
    --mode train \
    --vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \
    --config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \
    --bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \
    --bert_data_path data/bert_data_cnndm_final/cnndm \
    --log_file output/run2/train_abs_full_align.log \
    --model_path output/run2/model_full_align \
    --sep_optim true \
    --lr_bert 0.002 \
    --lr_dec 0.2 \
    --dec_dropout 0.2  \
    --train_steps 200000 \
    --save_checkpoint_steps 2000 \
    --warmup_steps_bert 20000 \
    --warmup_steps_dec 10000 \
    --batch_size 140 \
    --accum_count 5 \
    --report_every 50 \
    --max_pos 512 \
    --use_bert_emb true \
    --use_interval true \
    --visible_gpus $VISIBLE_GPUS \
    --distributed_port 10008

Validate & Test

set -x

VISIBLE_GPUS="0"

python src/train.py \
    --task abs \
    --mode validate \
    --test_all True \
    --batch_size 3000 \
    --test_batch_size 500 \
    --vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \
    --config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \
    --bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \
    --bert_data_path data/bert_data_cnndm_final/cnndm \
    --log_file output/run2/validate_abs_big_batch_size.log \
    --model_path output/run2/model_full_align \
    --result_path output/run2/predict_model_full_align/cnn_dm \
    --sep_optim true \
    --use_interval true \
    --visible_gpus "$VISIBLE_GPUS" \
    --max_pos 512 \
    --max_length 200 \
    --alpha 0.95 \
    --min_length 50

and got results:

validation top3:

(2.15112486930981, 'output/run2/model_full_align/model_step_140000.pt', 69),
(2.1529659433971706, 'output/run2/model_full_align/model_step_156000.pt', 77),
(2.1541826476652934, 'output/run2/model_full_align/model_step_138000.pt', 68)

test results of the validate-top3: 

model_step_140000.pt:
    ROUGE-F(1, 2, l): 41.49/19.05/38.56
    ROUGE-R(1, 2, l): 46.26/21.18/42.97
    ROUGE-P(1, 2, l): 39.68/18.29/36.90

model_step_156000.pt:
    ROUGE-F(1, 2, l): 41.41/18.98/38.46
    ROUGE-R(1, 2, l): 46.01/21.03/42.71
    ROUGE-P(1, 2, l): 39.72/18.29/36.91

model_step_138000.pt：
    ROUGE-F(1, 2, l): 41.50/19.11/38.60
    ROUGE-R(1, 2, l): 45.76/21.02/42.55
    ROUGE-P(1, 2, l): 40.11/18.55/37.32

AVG： ROUGE1, 2, L = 41.47, 19.05, 38.54

it is a bit lower than the paper report(41.72 , 19.39, 38.76).

What's more , I found the trigram-block in the code it not same to UniLM or fairseq, the code it block the already-existed duplicated trigram(full token), but the other is to prevent duplicated trigram generating in the step.(sub-token). I haven't test the affect of this difference. It is appreciate if you know this.

Thanks! Can't reproduce really heart me, I don't how where is wrong...

nlpyang commented 4 years ago

Enlarge the train steps will not change learning rate schedule.
Could you confirm all settings including rouge, pytorch are consistent with my codes? Cause I can reproduce similar results here with different random seeds (ROUGE-1 difference <0.1).
trigram-block can be different. I blocked full token trigrams because this is what we intended to, but it will affect beam search (it will reduce the number of available beams). Sub-token blocking seems to be a compromise to beam search.

fseasy commented 4 years ago

-1. Oh, I'm wrong. the learning rate under noam schedule is indeed not affected by the total step. -3. Ok, I got it~ -2.
I checked the ROUGE-1.5.5.pl to files2rouge-rouge-perl, it have no diff, so ROUGE-perl is same and pyrouge I just use your code. The Hang code is actually using the pytorch1.1, so it should be aligned. I just running by the released script. From my view, I should keep all consistent.

The only I change is the validation script, I add --test_all True to the options because I run validate after training finished. from the code logic, the only difference is that test_all will validate all checkpoints and keep the TOPK (code is seem 5, paper is 3) checkpoints and do test and report the ROUGE. While test_all == False will do validate & test for every checkpoint.

I just want to ensure, the selection of top3 result's checkpoint, is based on validation or test? I found validation and test may have a bit difference in distribution because I may got higher ROUGE in test while it it worse in validation (large validation loss. => is may be also because of difference of loss and ROUGE?)

Thanks so much.

nlpyang commented 4 years ago

It is based on validation, a better rouge can be achieved if you select them by validation rouge instead of validation ppl. (though that's not what we did). The code had a slight change in this release compared to the paper version, which will reduce the real batch size, but I have made sure the results are almost similar. If you still cannot reproduce the results, you could try enlarging accum_count or batch_size.

fseasy commented 4 years ago

Got it! I'll try to enlarge the batch size and total training step.

matt9704 commented 4 years ago

Finetune:

set -x

VISIBLE_GPUS="4,5,6,7"

# Totally align!!

python src/train.py \
    --task abs \
    --mode train \
    --vocab_path data/pytorch-bert-base-uncased/bert-base-uncased-vocab.txt \
    --config_path data/pytorch-bert-base-uncased/bert-base-uncased-config.json \
    --bert_model_path data/pytorch-bert-base-uncased/bert-base-uncased-pytorch_model.bin \
    --bert_data_path data/bert_data_cnndm_final/cnndm \
    --log_file output/run2/train_abs_full_align.log \
    --model_path output/run2/model_full_align \
    --sep_optim true \
    --lr_bert 0.002 \
    --lr_dec 0.2 \
    --dec_dropout 0.2  \
    --train_steps 200000 \
    --save_checkpoint_steps 2000 \
    --warmup_steps_bert 20000 \
    --warmup_steps_dec 10000 \
    --batch_size 140 \
    --accum_count 5 \
    --report_every 50 \
    --max_pos 512 \
    --use_bert_emb true \
    --use_interval true \
    --visible_gpus $VISIBLE_GPUS \
    --distributed_port 10008

Hi, I'm trying to run BertAbs as well. But in this repository only 4 trained models are provided: So I'm wondering what you put in the path MODEL_PATH: --model_path output/run2/model_full_align \ where did you download the model_full_align ?

thanks!

fseasy commented 4 years ago

@matt9704 It's my own training result~ The MODEL_PATH should be the output, not the input.

matt9704 commented 4 years ago

@matt9704 It's my own training result~ The MODEL_PATH should be the output, not the input.

Thanks! So if I want to finetune a BertAbs model, I can just set MODEL_PATH as the path to an empty folder (to save the finetuned model). It that right???

fseasy commented 4 years ago

May be not. if you want to train continuously based on the pretrained BertAbs, you need set the args train_from, at here

https://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train.py#L107

and the loading logic is here

https://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train_abstractive.py#L290-L293

I think it's better to see the code if you want to know more beyond the README ;) @matt9704

matt9704 commented 4 years ago

May be not. if you want to train continuously based on the pretrained BertAbs, you need set the args train_from, at here

https://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train.py#L107

and the loading logic is here

https://github.com/nlpyang/PreSumm/blob/70b810e0f06d179022958dd35c1a3385fe87f28c/src/train_abstractive.py#L290-L293

I think it's better to see the code if you want to know more beyond the README ;) @matt9704

Thanks, I'm just wondering where I can find the pretrained BertAbs, so that I can load it in and finetune it on my dataset...

nlpyang / PreSumm

I'm encountering hanging when Running BertAbs training as the given script params && can't reproduce the paper ROUGE #135

ENV

Hang problem