When packing=True, SFTTrainer wraps a given dataset with ConstantLengthDataset. ConstantLengthDataset's shuffle is set to True by default, so the order of constant-length tensors yielded by it are randomized. However, because the randomization occurs after samples are packed into tensors, the in-tensor order is not randomized.
Below is the code to check the current behavior (trl==0.10.1):
from datasets import Dataset
from trl.trainer import ConstantLengthDataset
# Dataset with "000", "111", ..., "777"
def gen():
for i in range(8):
yield {"text": f"{i}" * 3}
dataset = Dataset.from_generator(gen)
class FakeTokenizer:
# Tokenizer that just converts "000" to [0, 0, 0], etc. EOS token is 8.
def __init__(self):
self.eos_token_id = 8
def __call__(self, texts, **kwargs):
return {"input_ids": [[int(x) for x in text] for text in texts]}
packed_dataset = ConstantLengthDataset(
tokenizer=FakeTokenizer(),
dataset=dataset,
dataset_text_field="text",
seq_length=7,
infinite=False,
chars_per_token=1,
num_of_sequences=100,
shuffle=True,
append_concat_token=True,
add_special_tokens=True,
)
print("First epoch")
for x in packed_dataset:
print(x)
print("Second epoch")
for x in packed_dataset:
print(x)
You can see that the order inside tensors are always the same and that the last sample, "777", is always omitted as there are not enough samples left to make fifth tensor.
To fix these issues, I think it is better to modify the logic inside ConstantLengthDataset so that samples are shuffled before they are packed into tensors.
Motivation
I think these issues should be addressed because
when the way of packing samples is fixed across eopchs, the model could overfit more easily, and
it is a waste of data if some data are never used for training.
Your contribution
I can send a pull request that modifies ConstantLengthDataset so it shuffles buffer instead of examples.
Feature request
When
packing=True
,SFTTrainer
wraps a given dataset withConstantLengthDataset
.ConstantLengthDataset
'sshuffle
is set to True by default, so the order of constant-length tensors yielded by it are randomized. However, because the randomization occurs after samples are packed into tensors, the in-tensor order is not randomized.Below is the code to check the current behavior (trl==0.10.1):
Output:
You can see that the order inside tensors are always the same and that the last sample, "777", is always omitted as there are not enough samples left to make fifth tensor.
To fix these issues, I think it is better to modify the logic inside
ConstantLengthDataset
so that samples are shuffled before they are packed into tensors.Motivation
I think these issues should be addressed because
Your contribution
I can send a pull request that modifies
ConstantLengthDataset
so it shufflesbuffer
instead ofexamples
.