Multilingual same output sentence generated

robinsongh381 commented 5 years ago

I am trying to build a Abstractive PreSumm model for Korean

At the beginning, I used bert-multilingual model but I've found its tokenizer was poor so I've decided to use a sentecepiece which was trained on a large Korean corpus. Therefore, codes related to tokenizing, converting tokens to idx or idx to tokens have changed.

I did not change embedding parts since I gave 'False' to use_bert_embedargument

During training (with ~1500 articles and 50,000steps), the accuracy and ppl were gradually decreased to around 40 and 20 respectively. These are the values that I obtained when I trained original PreSumm model with CNNDM dataset. So I thought the model trained reasonably well.

However, when it comes to the evaluation step, all the generated output sentences (three sentences) are the same, regardless of the input sentence. This conflicts with the fact that the ppl was reasonably low - which cannot be observed if the same output sentence was produced during training.

Possible reasons, I suspect, would be (1) The model actually trained well but loaded untrained model (somehow) during evaluation, which I think unlikely since I didn't change this part of code

(2) Poor token embedding since bert-multilingual

(3) Not enough training data

Any advice would be helpful

astariul commented 5 years ago

I assume your are working on abstractive summarization.

First :

For abstractive summarization, PreSumm was trained on 200k steps, not 50k.

Second :

your dataset seems very little : CNN/DM contained ~300k samples. My guess is 1k5 samples is very small.

Also, why did you set use_bert_embed to False ?
If you are not sharing the weights, the decoder part is not using sentencepiece embeddings that you trained on large Korean corpus.
It is trained from scratch, which might impact the performance..

Did you look into the generated sentences ? Is it all the same but a clear and correct sentence, or is it garbage, like repeating the same token over and over ?

robinsongh381 commented 5 years ago

@Colanim Thank you for reply

As for your answers to your interesting questions ,

Ok. I will increase the amount of data.

However, even with 1.5K data and 50K steps, the model achieved "good" accuracy and "low" ppl which would indicate that the model trained alright with 1.5K.

Is this an incorrect statement ?

I noticed that the bert-multilingual-base-uncased (or cased) model's tokenizer and its vocab list barely contain tokens from my Sentencepiece model.

Therefore, I assumed that using the bert_embed would not be much beneficial.

By the way, when you said

"If you are not sharing the weights, the decoder part is not using sentencepiece embeddings that you trained on large Korean corpus"

you might be confused with bert and sentencepiece

I don't think there is no parameters within the sentencepiece model which i trained from the Korean corpus, whereas Bert obviously does have parameters for embedding layers.

Please let me know if there is a way of "using sentencepeice embedding values"

Thanks

astariul commented 5 years ago

Indeed I might be confused about bert and sentencepiece : I never used anything different than the one given by BERT's author ^^

The model can achieve good results on small datasets by memorization. If your goal is just to train the model for debugging, indeed 1k5 will be enough to ensure the model is learning something. But it will have really bad generalization capabilities (on unseen data).

Did you look at the predictions ?

robinsongh381 commented 5 years ago

Alright fair enough

Do you mean the predictions for the training data or validation data ?

I think it would be great to print the predictions for validation data during training but is this possible with the given code ?

astariul commented 5 years ago

It's not possible but you can easily modify the code to display it at testing time, during beam-search.

Insert these lines :

print("Gold: {}".format(gold_sent))
print("Pred : {}".format(pred_sents))
input()

Right after this line : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/models/predictor.py#L109

robinsongh381 commented 5 years ago

Thank you very much ! Will try and see what happens :+1:

robinsongh381 commented 5 years ago

Hang on @Colanim

print("Gold: {}".format(gold_sent)) print("Pred : {}".format(pred_sents)) input()

This addition of code seems to print predictions during TEST or VALIDATION not TRAINING ?!

robinsongh381 commented 5 years ago

The repetition problem has not been solved.

Any comments would be highly appreciated

astariul commented 5 years ago

Yes this print predictions during test only.

It's the easiest way to see the predictions. If you want to see it during training/validating, you need to change more code.

It's possible to do it, but cumbersome and not accurate : during training and validating, teacher-forcing is used. So it does not reflect the exact behavior of inference.

nlpyang / PreSumm

Multilingual same output sentence generated #69