Open jquesnelle opened 5 months ago
When a nanoset is particularly big (>4 GB), the calculation of offset (the actual location within the memmap) can overflow. The issue is with the line
offset
offset = dataset_sample * self.sequence_length * (np.iinfo(self.token_dtype).bits / 8)
Here, dataset_sample is a numpy "uint", and the calculation of offset can overflow since numpy's "uint" type is 32 bits. The solution is to promote everything to native Python int first, which has automatic overflow.
dataset_sample
"uint"
int
When a nanoset is particularly big (>4 GB), the calculation of
offset
(the actual location within the memmap) can overflow. The issue is with the lineoffset = dataset_sample * self.sequence_length * (np.iinfo(self.token_dtype).bits / 8)
Here,
dataset_sample
is a numpy"uint"
, and the calculation ofoffset
can overflow since numpy's"uint"
type is 32 bits. The solution is to promote everything to native Pythonint
first, which has automatic overflow.