ea_non_iid hangs on a big data file - Githubissues

usnistgov / SP800-90B_EntropyAssessment

The SP800-90B_EntropyAssessment C++package implements the min-entropy assessment methods included in Special Publication 800-90B.

195 stars 88 forks source link

ea_non_iid hangs on a big data file #214

Closed yiatsec closed 1 year ago

yiatsec commented 1 year ago

ea_non_iid hangs on the Tuple Estimates when processing the following 8MB file.

A10.bin (8MB) is uploaded to the ESV demo server (/esvp/v1/entropyAssessments/1393/dataFiles/3290)

The following command hangs: ea_non_iid -i -a -v A10.bin 8

The following command passes: ea_non_iid -i -a -v -l 0,1000000 A10.bin 8

The following command hangs: ea_non_iid -i -a -v -l 2,1000000 A10.bin 8

joshuaehill commented 1 year ago

If you upload the data file to a public location, I can take a look.

Absent the data, I can offer the following observations: The LRS / t-Tuple test runtimes can be particularly variable. Is the longest repeated substring very large for this data sample? If so, then this could be an instance of it taking a very long time, rather than it hanging. (I've seen all sorts of failure modes for this tool, but I've never witnessed an infinite loop, but it is certainly possible that something strange is going on here...)

celic commented 1 year ago

I obtained this file through ESV, and it is taking a long time to process the LRS which indicates that there is probably a very large repeated substring. Again @joshuaehill is right, the length of the LRS does drastically increase the runtime... Starting with the base case, LRS = 2, we need to check all of them if they are still equal if LRS = 3, ... repeat that for a very large LRS and it's an O(n^2) problem.

joshuaehill commented 1 year ago

As an amusing CS-magic aside, the current implementation of these estimators use a few clever tricks that make these estimators work fairly quickly for most well-behaved strings. The central (quasi-magical) algorithms for producing counts for all the repeated substrings runs in $\mathcal{O}(n)$ (where $n$ is the length of the data sample) and produces the Suffix Array (SA) and Longest Common Prefix (LCP) arrays.

When the length of the longest repeated substring (LRS) ($v$) is close to n the $\mathcal{O}(n v)$ algorithms in the LRS and $t$-Tuple algorithms have an essentially $\mathcal{O}(n^2)$ runtime. These estimator-specific algorithms are essentially interpreting the LCP array to calculate the Q array for the $t$-Tuple estimator (Q[i] is the count of the number of occurrences of the most common i-tuple) and calculating the $P_W$ sums, requiring a calculation for every distinct length $W$ tuple up to the length of the LRS. That this whole process happens as easily as $\mathcal{O}(n v)$ is sort of incredible, given that there are $\mathcal{O}(k^{v})$ different possible strings using a $k$-length alphabet length less than or equal to $v$ (of course, there are only $\mathcal{O}(n^2)$ possible substrings of a length $n$ string, which is why this algorithm can work in the first place, and doesn't have an alphabet-size term in it...)

celic commented 1 year ago

I started running this on Friday. It's still going as of Monday.

joshuaehill commented 1 year ago

@yiatsec I could try to further diagnose the problem if I had access to the file.

smuellerDD commented 1 year ago

Please obtain the file from https://www.chronox.de/A10.bin

joshuaehill commented 1 year ago

The longest repeated substring for this data is 2097152 bytes long, so the LRS estimate is having to do a very large amount of work. This suggests a real problem in the dataset.

As it happens, we can use just the LRS to provide an upper bound for the possible assessed entropy: we know that the LRS occurs at least 2 times, so (using the notation of SP 800-90B Section 6.3.6, step 3): $$\hat{p} \geq P_v^{1/v} \geq \left( \frac{1}{\binom{L-v+1}{2}} \right)^{1/v} \approx 0.999985 $$ As such, we see that $p_u \geq 0.999989$, so the min entropy estimate is $\leq 0.0000161591$. Again, this is an upper bound, and the actual result (were you to wait for it) is likely to be closer to 0.

In summary, this dataset suggests that the assessed entropy should be approximately 0.

celic commented 1 year ago

See #213. When the file has large repetitions it causes an unexpected scaling to occur when determining the longest repeated substring. This is not an issue with ea_non_iid but rather the low-entropy file provided to the system. I am going to close this issue.

joshuaehill commented 1 year ago

Aside: the literal (non-bitstream) LRS finished on my system, and reported that the value I used as a bound above was the actual result.

yiatsec commented 1 year ago

Thanks to Chris and Joshua for providing analysis and identifying the issue!