Reproduce generation results

zerocstaker commented 3 years ago

Hi, first of all thank you for the great work!

I would like to reproduce the generation results (running bart with non-factual tokens removed), but I don't think the code is provided for this part.

Here are my questions:

I have is the following: For the output_ids that are not considered, are they applied before tokenization and bpe?
For calculating the loss, I am assuming you are using CrossEntropy? How did you implement the masking (maybe like manually creating M and do reduction yourself)?

Of course, if you could provide the code to run all three models, it would be even better!

tagoyal commented 3 years ago

Hi, I will update code for this in the next couple of days.

The output ids are provided at the pre-tokenization level. We convert from this word level into BPE token level masking in the code. I will include this in the repo.
I'll provide the code. We basically replace these specific BPEs in the lm_labels (used to compute CE loss in the transformers) with some index that is ignored by CE loss (by setting ignore_index to that). It sounds confusing but I will include this in the code.

Thanks, Tanya

zerocstaker commented 3 years ago

Hi Tanya, thank you so much for your response and your explanations! Looking forward to the code!

tagoyal commented 3 years ago

I've updated the code and Readme in the generation folder. Let me know if you face any issues!

tagoyal / factuality-datasets

Reproduce generation results #4