huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

Fix overflow in nanosets with big datasets #182

Open jquesnelle opened 5 months ago

jquesnelle commented 5 months ago

When a nanoset is particularly big (>4 GB), the calculation of offset (the actual location within the memmap) can overflow. The issue is with the line

offset = dataset_sample * self.sequence_length * (np.iinfo(self.token_dtype).bits / 8)

Here, dataset_sample is a numpy "uint", and the calculation of offset can overflow since numpy's "uint" type is 32 bits. The solution is to promote everything to native Python int first, which has automatic overflow.