Open robinsongh381 opened 5 years ago
I assume your are working on abstractive summarization.
First :
Second :
Also, why did you set use_bert_embed
to False
?
If you are not sharing the weights, the decoder part is not using sentencepiece embeddings that you trained on large Korean corpus.
It is trained from scratch, which might impact the performance..
Did you look into the generated sentences ? Is it all the same but a clear and correct sentence, or is it garbage, like repeating the same token over and over ?
@Colanim Thank you for reply
As for your answers to your interesting questions ,
However, even with 1.5K data and 50K steps, the model achieved "good" accuracy and "low" ppl which would indicate that the model trained alright with 1.5K.
Is this an incorrect statement ?
Therefore, I assumed that using the bert_embed would not be much beneficial.
By the way, when you said
"If you are not sharing the weights, the decoder part is not using sentencepiece embeddings that you trained on large Korean corpus"
you might be confused with bert
and sentencepiece
I don't think there is no parameters within the sentencepiece model which i trained from the Korean corpus, whereas Bert obviously does have parameters for embedding layers.
Please let me know if there is a way of "using sentencepeice embedding values"
Thanks
Indeed I might be confused about bert
and sentencepiece
: I never used anything different than the one given by BERT's author ^^
The model can achieve good results on small datasets by memorization. If your goal is just to train the model for debugging, indeed 1k5 will be enough to ensure the model is learning something. But it will have really bad generalization capabilities (on unseen data).
Did you look at the predictions ?
Alright fair enough
Do you mean the predictions for the training data or validation data ?
I think it would be great to print the predictions for validation data during training but is this possible with the given code ?
It's not possible but you can easily modify the code to display it at testing time, during beam-search.
Insert these lines :
print("Gold: {}".format(gold_sent))
print("Pred : {}".format(pred_sents))
input()
Right after this line : https://github.com/nlpyang/PreSumm/blob/ba17e95de8cde9d5ddaeeba01df7cace584511b2/src/models/predictor.py#L109
Thank you very much ! Will try and see what happens :+1:
Hang on @Colanim
print("Gold: {}".format(gold_sent)) print("Pred : {}".format(pred_sents)) input()
This addition of code seems to print predictions during TEST or VALIDATION not TRAINING ?!
The repetition problem has not been solved.
Any comments would be highly appreciated
Yes this print predictions during test only.
It's the easiest way to see the predictions. If you want to see it during training/validating, you need to change more code.
It's possible to do it, but cumbersome and not accurate : during training and validating, teacher-forcing is used. So it does not reflect the exact behavior of inference.
I am trying to build a Abstractive PreSumm model for Korean
At the beginning, I used bert-multilingual model but I've found its tokenizer was poor so I've decided to use a sentecepiece which was trained on a large Korean corpus. Therefore, codes related to tokenizing, converting tokens to idx or idx to tokens have changed.
I did not change embedding parts since I gave 'False' to
use_bert_embed
argumentDuring training (with ~1500 articles and 50,000steps), the accuracy and ppl were gradually decreased to around 40 and 20 respectively. These are the values that I obtained when I trained original PreSumm model with CNNDM dataset. So I thought the model trained reasonably well.
However, when it comes to the evaluation step, all the generated output sentences (three sentences) are the same, regardless of the input sentence. This conflicts with the fact that the ppl was reasonably low - which cannot be observed if the same output sentence was produced during training.
Possible reasons, I suspect, would be (1) The model actually trained well but loaded untrained model (somehow) during evaluation, which I think unlikely since I didn't change this part of code
(2) Poor token embedding since
bert-multilingual
(3) Not enough training data
Any advice would be helpful