Fix overflow in nanosets with big datasets

When a nanoset is particularly big (>4 GB), the calculation of offset (the actual location within the memmap) can overflow. The issue is with the line

offset = dataset_sample * self.sequence_length * (np.iinfo(self.token_dtype).bits / 8)

Here, dataset_sample is a numpy "uint", and the calculation of offset can overflow since numpy's "uint" type is 32 bits. The solution is to promote everything to native Python int first, which has automatic overflow.

huggingface / nanotron

Fix overflow in nanosets with big datasets #182