Program crashes on large datasets; uninformative error

mslpensec commented 5 years ago

I tried running the non-iid tests on a 10 million byte data set and got the following error:

Number of Symbols: 10000000
Number of Binary Symbols: 80000000

Symbol alphabet consists of 256 unique symbols

Running non-IID tests...

Running Most Common Value Estimate...
    Most Common Value Estimate (bit string) = 0.999398 / 1 bit(s)
    Most Common Value Estimate = 7.959992 / 8 bit(s)

Running Entropic Statistic Estimates (bit strings only)...
    Collision Test Estimate (bit string) = 0.969831 / 1 bit(s)
    Markov Test Estimate (bit string) = 0.999602 / 1 bit(s)
    Compression Test Estimate (bit string) = 0.759095 / 1 bit(s)

Running Tuple Estimates...
    T-Tuple Test Estimate (bit string) = 0.940461 / 1 bit(s)
    T-Tuple Test Estimate = 7.766197 / 8 bit(s)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Killed

This error is not very informative. Ideally this error would result in a user friendly description of what happened and a suggestion or two to resolve the problem.
After looking it up, I understand this is a memory allocation error, but that raises the question of whether I was running a larger dataset than the program was intended to handle (in which case that limitation should be better documented), if I need more robust hardware (also a documentation issue), or if there is a memory issue in the program itself.
If there is a certain maximum size that the program can reasonably handle, then ideally the program should do a quick check and print a warning or an error before spending a long time running tests.
It seems reasonable that people will want to assess large blocks of data. Could the truncation function be extended such that for a large data file like this one, the assessment would break it into blocks of 1 million bytes, assess each one, and print the results for all 10?

joshuaehill commented 5 years ago

You should try this with the version in the 2018UL branch of my repository.

Your test is running out of RAM while conducting the LRS test. In my modified code, I use a somewhat fancier way of conducting these two tests (see the last three pages of the implementation comments on this page), with the result being that the required memory doesn't doesn't grows exponentially with the number of symbols, and scales well with respect to the input size.

In either code base, this is fundamentally a limitation of the machine on which you ran the test, not of the code base. Given more RAM, NIST's original code would have likely finished.

mslpensec commented 5 years ago

@joshuaehill You were right! I ran the 10 million byte dataset using the latest version of the 90B assessment (which I understand includes this and other performance enhancements you wrote) and the assessment completed with no errors.

Thank you for your efforts to make the test suite more efficient!

usnistgov / SP800-90B_EntropyAssessment

Program crashes on large datasets; uninformative error #96