Closed alextrott16 closed 1 year ago
@abhi-mosaic Thanks! And, yeah, I've manually tested this and confirmed that it behaves as expected. The dataloader tests turn this feature on for both raw and pretokenized text, and it doesn't fail any of the asserts. I just pushed another commit that improves the printout for sanity checking and my sanity is confirmed intact :)
Provides way to get concatenation and
"sequence_id"
support from raw text data indenoising.py
.Note: this PR also folds in a minor feature addition to the mixture-of-denoisers collator, which controls whether an EOS tag is appended to the end of the context (i.e. "prefix"). This feature is off by default.