google-research / deduplicate-text-datasets

Apache License 2.0
1.12k stars 111 forks source link

Question: Upper Bound #44

Closed bezir closed 6 months ago

bezir commented 6 months ago

If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

Just out of the curiosity, what was the reason of this calculation?

carlini commented 6 months ago

We represent pointers into the dataset as 64 bit unsigned integers, and so any one suffix array can't have more than 2^64-1 bytes.