Add support for large files. (> 2GB)

joshuaehill commented 1 year ago

This is an update of PR #217 that I had to recreate because git automatically closed the prior PR when I forced my testing repository to be synchronized with the NIST repository. (I'm sure that I messed it up somehow.)

These are the same "large file support" changes as in PR #217.

Add support for datasets with large number of samples files. (> 2GS).
- Requires divsufsort64 (generally included in the relevant package).
Resolved initialization error in ea_conditioning.

This is mainly useful because it allows for more samples of wide data; this comes up most frequently with the bitstring tests. For 8-bit data, the prior code would fail when the dataset was larger than about 256MB. This is commonly relevant when doing the statistical assessment for non-vetted conditioning functions.

Both -ldivsufsort64 and -ldivsufsort libraries are required when linking. The tool opportunistically uses the 32-bit-index version of the tool when it can, because the 32-bit-index version of the library (which for an n-symbol-length string requires 13n bytes of memory) uses approximately half the memory of the 64-bit-index version (which uses 25n bytes of memory for the same task).

celic commented 1 year ago

The plan is to merge this in after we can do some testing locally on the changes. This changes the build process slightly. We use the GitHub code for our testing on ESVTS, though we won't be using the large file support on that platform.

joshuaehill commented 1 year ago

The changes look large, but the bulk of them are exactly the same logic as the 32-bit index code, but using different types. (I suppose using C++ Templates would be more logically clean, but C++ Templates make my skin crawl). In any case, let me know if you have any questions.

usnistgov / SP800-90B_EntropyAssessment

Add support for large files. (> 2GB) #226