How to use custom/append to vocab.txt?

ShoubhikBanerjee commented 4 years ago

Hi, first of all, thank you for the awesome piece of work that you have shared.

I have fine-tuned ProphetNet for summarization on AmazonFoodReview dataset, it works awesome.

Just wanted to know that how can we update or use our own vocab.txt, so no words are actually missing. You know just in case of some scientific, medical, topic oriented document classification.

What are the steps that is need to be taken in such cases?

I am waiting for your reply.

Thank You.

yuyan2do commented 4 years ago

You can edit this file, and replace [unused] token with your own token, then reprocess the data.

ShoubhikBanerjee commented 4 years ago

A lot of thanx for your prompt reply,

I thought it of doing , but in case my set of new words does not fits to [unused993] 998 , then? @yuyan2do .

yuyan2do commented 4 years ago

The number of new words is larger than 998? Then it need some code change to support it.

ShoubhikBanerjee commented 4 years ago

Ya, can you please help me out in this? It would be very grateful to you.

yuyan2do commented 4 years ago

Sure, I can do it early next week.

ShoubhikBanerjee commented 4 years ago

Okay, thanx a lot, I will be waiting for you.

yuyan2do commented 4 years ago

@ShoubhikBanerjee I have commit change to support append vocab. You can have a try.

Add new tokens at end of this vocab
Reprocess data using new vocab
Start training

In console output, check the word embedding number increase as expected. In below example, I added 3 new tokens.

(embed_tokens): Embedding(~~30522~~, 1024, padding_idx=0) (embed_tokens): Embedding(30525, 1024, padding_idx=0)

ShoubhikBanerjee commented 4 years ago

Thanx a lot, I will try and let you know.

yuyan2do commented 4 years ago

Shoubhik, have you got time to try it?

ShoubhikBanerjee commented 4 years ago

Sorry, for being so late. Actually, I was engaged with some other stuff.

My point is : White tokenizing using BERT tokenizer, some words like "biocompatible" is tokenized to "bio ##com ##pati ##ble". So the actual word is already lost. So will it work on adding "biocompatible" in vocab.txt. I think no. Because the word is no longer present there as a whole word.

So any wayaround for it?

farooqzaman1 commented 3 years ago

Hi @ShoubhikBanerjee i am also working this, the workaround is change or add flag --tokenizer nltk to your fairrseq-preprocess command this will solve your problem. i am now working on adopting the vocabulary for scientific articles and there many terms that need to be added to the vocabulary lets me know if you have found any solution for this
for the sake of convenience is am pasting the command here. fairseq-preprocess --user-dir ./prophetnet --task translation_prophetnet --tokenizer nltk --source-lang src --target-lang tgt --trainpref cnndm/prophetnet_tokenized/train --validpref cnndm/prophetnet_tokenized/valid --testpref cnndm/prophetnet_tokenized/test --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt --workers 20

ShoubhikBanerjee commented 3 years ago

Hi @yuyan2do I tried this , and finetuned Amazon Food Review dataset and found a strange thing over there, while the previous version was generating some output as BPE tokenized, but your latest code failed to generate any output (for some cases, giving [UNK] tokens, ). Moreover, the output summary is escaping the extra words that I have added to the custom vocab.txt.

Text => This taffy is so good. It is very soft and chewy. The flavors are amazing. I would definitely recommend you buying it. Very satisfying!! Original Summary => Wonderful, tasty taffy Predicted Summary (Previous Version) => yu ##m yu ##m Predicted Summary (Current Version) => [UNK] [UNK]

_Current vocab.txt file ...

： 30519

？ 30520

～ 30521

vitality 30522 jumbo 30523 salted 30524 taffy 30525 saltwater 30526 tasty 30527 twizzlers 30528 yummy 30529 oatmeals 30530 gastronomy 30531 holistic 30532 oatmeal 30533_

It seems quite strange to me, is there anything wrong going on?

The most strange part is that : it is escaping the custom(extra) words that are added to vocab.txt

microsoft / ProphetNet