The OpenChat training system utilizes padding-free training and the Multipack Sampler, achieving a 3~10x speedup compared to the conventional padded training.
What is the meaning of padding-free here? Is there a need for all seqs in one batch
to have the same length? If no padding, how is this done?
In the readme, it says:
The OpenChat training system utilizes padding-free training and the Multipack Sampler, achieving a 3~10x speedup compared to the conventional padded training.
What is the meaning of padding-free here? Is there a need for all seqs in one batch to have the same length? If no padding, how is this done?
Thanks!