ConstantLengthDataset should shuffle the order of samples before packing

muupan commented 2 months ago

Feature request

When packing=True, SFTTrainer wraps a given dataset with ConstantLengthDataset. ConstantLengthDataset's shuffle is set to True by default, so the order of constant-length tensors yielded by it are randomized. However, because the randomization occurs after samples are packed into tensors, the in-tensor order is not randomized.

Below is the code to check the current behavior (trl==0.10.1):

from datasets import Dataset
from trl.trainer import ConstantLengthDataset

# Dataset with "000", "111", ..., "777"
def gen():
    for i in range(8):
        yield {"text": f"{i}" * 3}
dataset = Dataset.from_generator(gen)

class FakeTokenizer:
    # Tokenizer that just converts "000" to [0, 0, 0], etc. EOS token is 8.
    def __init__(self):
        self.eos_token_id = 8

    def __call__(self, texts, **kwargs):
        return {"input_ids": [[int(x) for x in text] for text in texts]}

packed_dataset = ConstantLengthDataset(
    tokenizer=FakeTokenizer(),
    dataset=dataset,
    dataset_text_field="text",
    seq_length=7,
    infinite=False,
    chars_per_token=1,
    num_of_sequences=100,
    shuffle=True,
    append_concat_token=True,
    add_special_tokens=True,
)
print("First epoch")
for x in packed_dataset:
    print(x)
print("Second epoch")
for x in packed_dataset:
    print(x)

Output:

First epoch
{'input_ids': tensor([0, 0, 0, 8, 1, 1, 1]), 'labels': tensor([0, 0, 0, 8, 1, 1, 1])}
{'input_ids': tensor([8, 2, 2, 2, 8, 3, 3]), 'labels': tensor([8, 2, 2, 2, 8, 3, 3])}
{'input_ids': tensor([5, 5, 8, 6, 6, 6, 8]), 'labels': tensor([5, 5, 8, 6, 6, 6, 8])}
{'input_ids': tensor([3, 8, 4, 4, 4, 8, 5]), 'labels': tensor([3, 8, 4, 4, 4, 8, 5])}
Second epoch
{'input_ids': tensor([3, 8, 4, 4, 4, 8, 5]), 'labels': tensor([3, 8, 4, 4, 4, 8, 5])}
{'input_ids': tensor([8, 2, 2, 2, 8, 3, 3]), 'labels': tensor([8, 2, 2, 2, 8, 3, 3])}
{'input_ids': tensor([0, 0, 0, 8, 1, 1, 1]), 'labels': tensor([0, 0, 0, 8, 1, 1, 1])}
{'input_ids': tensor([5, 5, 8, 6, 6, 6, 8]), 'labels': tensor([5, 5, 8, 6, 6, 6, 8])}

You can see that the order inside tensors are always the same and that the last sample, "777", is always omitted as there are not enough samples left to make fifth tensor.

To fix these issues, I think it is better to modify the logic inside ConstantLengthDataset so that samples are shuffled before they are packed into tensors.

Motivation

I think these issues should be addressed because

when the way of packing samples is fixed across eopchs, the model could overfit more easily, and
it is a waste of data if some data are never used for training.

Your contribution

I can send a pull request that modifies ConstantLengthDataset so it shuffles buffer instead of examples.

qgallouedec commented 2 months ago

Thank you very much @muupan. I love receiving such clearly explained issues.

I agree with you. Feel free to open a PR to implement this change.

muupan commented 2 months ago

@qgallouedec Thanks, I'll open a PR soon.

huggingface / trl