Open ShoubhikBanerjee opened 4 years ago
You can edit this file, and replace [unused] token with your own token, then reprocess the data.
A lot of thanx for your prompt reply,
I thought it of doing , but in case my set of new words does not fits to [unused993] 998 , then? @yuyan2do .
The number of new words is larger than 998? Then it need some code change to support it.
Ya, can you please help me out in this? It would be very grateful to you.
Sure, I can do it early next week.
Okay, thanx a lot, I will be waiting for you.
@ShoubhikBanerjee I have commit change to support append vocab. You can have a try.
In console output, check the word embedding number increase as expected. In below example, I added 3 new tokens.
(embed_tokens): Embedding(30522, 1024, padding_idx=0)
(embed_tokens): Embedding(30525, 1024, padding_idx=0)
Thanx a lot, I will try and let you know.
Shoubhik, have you got time to try it?
Sorry, for being so late. Actually, I was engaged with some other stuff.
My point is : White tokenizing using BERT tokenizer, some words like "biocompatible" is tokenized to "bio ##com ##pati ##ble". So the actual word is already lost. So will it work on adding "biocompatible" in vocab.txt. I think no. Because the word is no longer present there as a whole word.
So any wayaround for it?
Hi @ShoubhikBanerjee
i am also working this, the workaround is change or add flag --tokenizer nltk to your fairrseq-preprocess command this will solve your problem. i am now working on adopting the vocabulary for scientific articles and there many terms that need to be added to the vocabulary lets me know if you have found any solution for this
for the sake of convenience is am pasting the command here.
fairseq-preprocess --user-dir ./prophetnet --task translation_prophetnet --tokenizer nltk --source-lang src --target-lang tgt --trainpref cnndm/prophetnet_tokenized/train --validpref cnndm/prophetnet_tokenized/valid --testpref cnndm/prophetnet_tokenized/test --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt --workers 20
Hi @yuyan2do I tried this , and finetuned Amazon Food Review dataset and found a strange thing over there, while the previous version was generating some output as BPE tokenized, but your latest code failed to generate any output (for some cases, giving [UNK] tokens, ). Moreover, the output summary is escaping the extra words that I have added to the custom vocab.txt.
Text => This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!! Original Summary => Wonderful, tasty taffy Predicted Summary (Previous Version) => yu ##m yu ##m Predicted Summary (Current Version) => [UNK] [UNK]
_Current vocab.txt file ...
vitality 30522 jumbo 30523 salted 30524 taffy 30525 saltwater 30526 tasty 30527 twizzlers 30528 yummy 30529 oatmeals 30530 gastronomy 30531 holistic 30532 oatmeal 30533_
It seems quite strange to me, is there anything wrong going on?
The most strange part is that : it is escaping the custom(extra) words that are added to vocab.txt
Hi, first of all, thank you for the awesome piece of work that you have shared.
I have fine-tuned ProphetNet for summarization on AmazonFoodReview dataset, it works awesome.
Just wanted to know that how can we update or use our own vocab.txt, so no words are actually missing. You know just in case of some scientific, medical, topic oriented document classification.
What are the steps that is need to be taken in such cases?
I am waiting for your reply.
Thank You.