mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
441 stars 125 forks source link

Add bin packing collator wrapper in denoising #268

Closed alextrott16 closed 1 year ago

alextrott16 commented 1 year ago

Provides way to get concatenation and "sequence_id" support from raw text data in denoising.py.

Note: this PR also folds in a minor feature addition to the mixture-of-denoisers collator, which controls whether an EOS tag is appended to the end of the context (i.e. "prefix"). This feature is off by default.

alextrott16 commented 1 year ago

@abhi-mosaic Thanks! And, yeah, I've manually tested this and confirmed that it behaves as expected. The dataloader tests turn this feature on for both raw and pretokenized text, and it doesn't fail any of the asserts. I just pushed another commit that improves the printout for sanity checking and my sanity is confirmed intact :)