The length of samples - Githubissues

hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]

Apache License 2.0

458 stars 28 forks source link

The length of samples #26

Closed Ber666 closed 3 days ago

Ber666 commented 3 months ago

It seems each sample in the deita dataset consists of a lot of turns and is super long (>10k tokens). Your paper mentioned the max length of input is 2048 for SFT. Does that mean most text of each training sample is truncated and discarded?

VPeterV commented 3 months ago

Hi. Yes, for those samples whose length are longer than model_max_length, we simply truncate the length of those samples to 2048.