Fix error causing t-tuple underestimation with -i, -t and len > 1E6 symbols

dj-on-github commented 4 years ago

When the second call (conditioned with initial==True) to the t-tuple estimation occurs, SAalgs is called with the file length. When -t is used to truncate to 1,000,000 symbols and the file length is longer than 1,000,000 bytes, the length passed to SAalgs is not 1,000,000, but instead is the file length. This leads to an incorrect entropy estimation. This is not seen with the -c option since data.blen in used. In the -i case data.len is used. Around line 139 of non_iid_main.cpp, data.blen is appropriately truncated, but data.len is not.

joshuaehill commented 4 years ago

I suspect that the -t option just isn't that well described, as this is the second time that this has come up. (see also issue #127)

The option -t isn't intended to truncate the input data, it's intended to allow for the truncation mentioned in the third sentence of paragraph 3 in section 3.1.3 of SP800-90B, that is for multi-bit samples, you can truncate the binary representation of the multi-bit input data that is used to produce the H_bitstring assessment.

You can do what I think that you're trying to do with the -l 0,1000000 -t options, which would read only 1000000 data samples from the data file (so data.len == 1000000), and truncate the the produced binary string (presuming that the underlying data is not binary, whence data.blen == 1000000).

Having said all that, I'm really curious how this leads to an incorrect entropy estimation; could you describe what you were seeing?

joshuaehill commented 4 years ago

One extra point to try to explain the (admittedly somewhat bizarre) behavior that you saw: the reason that this worked the way you were expecting when you used the -c flag is that -c signals that you're doing the testing described in SP800-90B Section 3.1.5.2 (paragraph 1 and 2), and (as per the second paragraph) all such data is treated as a binary string only, irrespective of its actual raw format. As such, this mode assesses only H_bitstring.

dj-on-github commented 4 years ago

In this case, I am testing at 1 bit per symbol. So the data length for the bitstring and the data length for the symbols is the same. The intent is to measure 1,000,000 symbols. By 'wrong' I mean it gives a different result compared against a separate implementation that is a literal implementation of the spec. So we went digging into why it was different. What we got was a match with the expected result with -c (to be expected, 3.1.5.2 drops you back to the level of the non_iid test minimum). But with -i and -t (to truncate to 1,000,000) we got a different, lower number - the lower number makes sense given what was at the end of the test file, but we were trying to measure the first 1,000,000 symbols. So one number was wrong and we were obviously concerned that the number we were putting in our entropy justification document was the wrong one. The test we were trying to do is an initial assessment over 1,000,000 1 bit symbols. After much digging it comes down to the -t option not truncating a bitstring you pass in (1 bit per symbol syle) when you do an initial (-i) assesment. So the cheap-trick is to truncate the nob file to 1,00,000 bytes (since it's 1 bit per byte) to exactly meet the spec requirement. Looking at the help text is does literally say it truncates the bitstring which is counterintuitive - we're passing it the bitstring and expect it to truncate it. There is no alternative bitstring when doing a 1 bit per symbol analysis. The code only truncates blen and not len which has the air of wrongness about it.

joshuaehill commented 4 years ago

For 1 bit per symbol, you can just use -l 0,1000000 to do what you want: that will read only the first 1000000 symbols from the file. For binary data, the testing process described by 90B never produces any H_bitstring, so it doesn't impact results if you truncate the data that would ultimately produce the H_bitstring assessment.

I don't think that the change you are proposing here is desirable, because it doesn't allow for testing as described in the 90B process in all cases using a single invocation of the command (e.g., if you wanted to truncate the data for the calculation of H_bitstring, but also wanted to test more than 1000000 samples of the full data.)

I feel somewhat conflicted here. Truncation of the data used to produce H_bitstring is allowed by section 3.1.3 of 90B, so I suppose that the tool should support this functionality. Having said that, I think that the -t option generally shouldn't be used. :-) I suspect that this option was made in a time when this tool took a long time to run and was very resource constrained, but at this point it's fairly speedy and not as resource demanding as it once was, so there aren't that many testing situations where the truncation is still desirable.

Given that multiple people have had the same misunderstanding regarding the meaning of this option, I think that there is an issue, but it isn't (so far as I know) how the truncation actually occurs. I think that the best way of addressing this issue is to change the help text in the README file and the help text produced by the ea_iid and ea_non_iid tools. I'll put together such a change and link to it here.

joshuaehill commented 4 years ago

I think that the the above commit should make it more clear what -t is intended to accomplish for non_iid_main.c. This commit contains the analogous changes for iid_main.c ; if you have any comments on these changes, let me know.

These commits are to a branch that is the basis of pull request #137, so hopefully they will be integrated soon. :-)

If you agree that this addresses the issue, please close this pull request.

Thanks!

dj-on-github commented 4 years ago

That looks ok. I'm closing it. Thank you for your help.

usnistgov / SP800-90B_EntropyAssessment

Fix error causing t-tuple underestimation with -i, -t and len > 1E6 symbols #139