poor entropy value for a large data file but good entropy value for each segment of the data file

yiatsec commented 1 year ago

ea_non_iid will return great entropy results for the individual 1,000,000 samples in the 8MB file, but very poor entropy for the whole file.

Great entropy result for each 1,000,000 samples: ea_non_iid -i -a -v -l 0,1000000 A14.bin 8 ea_non_iid -i -a -v -l 1,1000000 A14.bin 8 ea_non_iid -i -a -v -l 2,1000000 A14.bin 8 ... ea_non_iid -i -a -v -l 7,1000000 A14.bin 8

Poor entropy result for the entire file (H_bitstring: 0.019464): ea_non_iid -i -a -v A14.bin 8

The data file A14.bin is uploaded to the ESV demo server (/esvp/v1/entropyAssessments/1398/dataFiles/3300)

joshuaehill commented 1 year ago

This sort of thing tends to be due to large-scale patterns in the data. The one that I've seen the most is a very long repeated substring. If this is the case, then this is an indication of a real problem, either in the underlying noise source or in the data handling (e.g. the same result was appended to the file more than once).

Again, if you provide the data file, then I can take a look. Absent the data file, I can only guess what is going on.

celic commented 1 year ago

It would be helpful if you ran this with -vv so we can see more output. I agree with @joshuaehill that the most likely culprit is the LRS. Do you at least have the individual test scores available?

smuellerDD commented 1 year ago

Executing with -vv and on the initial 8MB file:

Opening file: 'A14.bin' (SHA-256 hash 90689f222cd5a1a7fd8bdf2d447f9666fbcaefca0887af96599609ee2e634e1a)
Loaded 8388608 samples of 256 distinct 8-bit-wide symbols
Number of Binary Symbols: 67108864
Bitstring Most Common Value Estimate: Mode count = 33561660
Bitstring Most Common Value Estimate: p-hat = 0.50010770559310913
Bitstring Most Common Value Estimate: p_u = 0.50026492173464832
Bitstring Most Common Value Estimate: min entropy = 0.99923579988974387
Literal Most Common Value Estimate: Mode count = 34058
Literal Most Common Value Estimate: p-hat = 0.0040600299835205078
Literal Most Common Value Estimate: p_u = 0.0041165826851581851
Literal Most Common Value Estimate: min entropy = 7.9243370804936237
Bitstring Collision Estimate: v = 26843774
Bitstring Collision Estimate: Sum t_i = 67108864
Bitstring Collision Estimate: X-bar = 2.4999787287733835
Bitstring Collision Estimate: sigma-hat = 0.50000000886068141
Bitstring Collision Estimate: X-bar' = 2.4997301492764841
Bitstring Collision Estimate: Found p.
Bitstring Collision Estimate: p = 0.51161573767601376
Bitstring Collision Estimate: min entropy = 0.96686745162572785
Bitstring Markov Estimate: P_0 = 0.50010770559310913
Bitstring Markov Estimate: P_1 = 0.49989229440689087
Bitstring Markov Estimate: P_{0,0} = 0.5001752744106005
Bitstring Markov Estimate: P_{0,1} = 0.4998247255893995
Bitstring Markov Estimate: P_{1,0} = 0.50004009276004047
Bitstring Markov Estimate: P_{1,1} = 0.49995990723995953
Bitstring Markov Estimate: p-hat_max = 3.0731613947760277e-39
Bitstring Markov Estimate: min entropy = 0.99949587628858816
Bitstring Compression Estimate: X-bar = 5.2174238132654009
Bitstring Compression Estimate: sigma-hat = 1.015438338817628
Bitstring Compression Estimate: X-bar' = 5.2166416890253631
Bitstring Compression Estimate: Found p.
Bitstring Compression Estimate: p = 0.020758119653956797
Bitstring Compression Estimate: min entropy = 0.93169673745482251
Bitstring t-Tuple Estimate: t = 1053
Bitstring t-Tuple Estimate: p-hat_max = 0.98656274725037851
Bitstring t-Tuple Estimate: p_u = 0.98659895026523092
Bitstring t-Tuple Estimate: min entropy = 0.019464342572138409
Bitstring LRS Estimate: u = 1054
Bitstring LRS Estimate: v = 1086
Bitstring LRS Estimate: p-hat = 0.97284168114925829
Bitstring LRS Estimate: p_u = 0.97289279041778232
Bitstring LRS Estimate: min entropy = 0.039647261365447189
Literal t-Tuple Estimate: t = 101
Literal t-Tuple Estimate: p-hat_max = 0.88457868137893292
Literal t-Tuple Estimate: p_u = 0.88486285476711379
Literal t-Tuple Estimate: min entropy = 0.17647422622243453
Literal LRS Estimate: u = 102
Literal LRS Estimate: v = 134
Literal LRS Estimate: p-hat = 0.8046557603231751
Literal LRS Estimate: p_u = 0.80500835643773239
Literal LRS Estimate: min entropy = 0.31292433559960725
Bitstring MultiMCW Prediction Estimate: C = 33553966
Bitstring MultiMCW Prediction Estimate: r = 1088
Bitstring MultiMCW Prediction Estimate: N = 67108801
Bitstring MultiMCW Prediction Estimate: P_global = 0.49999352543938314
Bitstring MultiMCW Prediction Estimate: P_global' = 0.50015074165835194
Bitstring MultiMCW Prediction Estimate: P_local = 0.98310269845213949
Bitstring MultiMCW Prediction Estimate: min entropy = 0.024585961324006745
Literal MultiMCW Prediction Estimate: C = 33451
Literal MultiMCW Prediction Estimate: r = 136
Literal MultiMCW Prediction Estimate: N = 8388545
Literal MultiMCW Prediction Estimate: P_global = 0.0039876998931280697
Literal MultiMCW Prediction Estimate: P_global' = 0.0040437488288347445
Literal MultiMCW Prediction Estimate: P_local = 0.87294955796959184
Literal MultiMCW Prediction Estimate: min entropy = 0.1960298025098561
Bitstring Lag Prediction Estimate: C = 33550683
Bitstring Lag Prediction Estimate: r = 1065
Bitstring Lag Prediction Estimate: N = 67108863
Bitstring Lag Prediction Estimate: P_global = 0.49994414299643253
Bitstring Lag Prediction Estimate: P_global' = 0.50010135914180964
Bitstring Lag Prediction Estimate: P_local = 0.98272029623965684
Bitstring Lag Prediction Estimate: min entropy = 0.025147242562472302
Literal Lag Prediction Estimate: C = 33249
Literal Lag Prediction Estimate: r = 110
Literal Lag Prediction Estimate: N = 8388607
Literal Lag Prediction Estimate: P_global = 0.0039635901407706906
Literal Lag Prediction Estimate: P_global' = 0.0040194698526339143
Literal Lag Prediction Estimate: P_local = 0.84377153773034608
Literal Lag Prediction Estimate: min entropy = 0.24507567178593684
Bitstring MultiMMC Prediction Estimate: C = 33567632
Bitstring MultiMMC Prediction Estimate: r = 1083
Bitstring MultiMMC Prediction Estimate: N = 67108862
Bitstring MultiMMC Prediction Estimate: P_global = 0.5001967102347824
Bitstring MultiMMC Prediction Estimate: P_global' = 0.50035392637014497
Bitstring MultiMMC Prediction Estimate: P_local = 0.9830209723317318
Bitstring MultiMMC Prediction Estimate: min entropy = 0.024705898711917001
Literal MultiMMC Prediction Estimate: C = 33454
Literal MultiMMC Prediction Estimate: r = 135
Literal MultiMMC Prediction Estimate: N = 8388606
Literal MultiMMC Prediction Estimate: P_global = 0.0039880285234519296
Literal MultiMMC Prediction Estimate: P_global' = 0.0040440795555922171
Literal MultiMMC Prediction Estimate: P_local = 0.87202446449970961
Literal MultiMMC Prediction Estimate: min entropy = 0.19755948475063176
Bitstring LZ78Y Prediction Estimate: C = 33555888
Bitstring LZ78Y Prediction Estimate: r = 1088
Bitstring LZ78Y Prediction Estimate: N = 67108847
Bitstring LZ78Y Prediction Estimate: P_global = 0.50002182275609652
Bitstring LZ78Y Prediction Estimate: P_global' = 0.50017903892104654
Bitstring LZ78Y Prediction Estimate: P_local = 0.98310269779766868
Bitstring LZ78Y Prediction Estimate: min entropy = 0.02458596228443722
Literal LZ78Y Prediction Estimate: C = 33453
Literal LZ78Y Prediction Estimate: r = 135
Literal LZ78Y Prediction Estimate: N = 8388591
Literal LZ78Y Prediction Estimate: P_global = 0.0039879164450859509
Literal LZ78Y Prediction Estimate: P_global' = 0.004043966742866445
Literal LZ78Y Prediction Estimate: P_local = 0.87202447666401872
Literal LZ78Y Prediction Estimate: min entropy = 0.19755946462575105
H_bitstring = 0.019464342572138409
H_bitstring Per Symbol = 0.15571474057710727
H_original = 0.17647422622243453
Assessed min entropy: 0.15571474057710727

smuellerDD commented 1 year ago

Executing with -vv and on the first 1MB:

Opening file: 'A14.bin' (SHA-256 hash 90689f222cd5a1a7fd8bdf2d447f9666fbcaefca0887af96599609ee2e634e1a), reading block 0 of size 1000000
Loaded 1000000 samples of 256 distinct 8-bit-wide symbols
Number of Binary Symbols: 8000000
Bitstring Most Common Value Estimate: Mode count = 4000019
Bitstring Most Common Value Estimate: p-hat = 0.500002375
Bitstring Most Common Value Estimate: p_u = 0.5004577216203836
Bitstring Most Common Value Estimate: min entropy = 0.9986798987230765
Literal Most Common Value Estimate: Mode count = 4099
Literal Most Common Value Estimate: p-hat = 0.0040990000000000002
Literal Most Common Value Estimate: p_u = 0.0042635751805110058
Literal Most Common Value Estimate: min entropy = 7.8737205884746455
Bitstring Collision Estimate: v = 3200173
Bitstring Collision Estimate: Sum t_i = 7999998
Bitstring Collision Estimate: X-bar = 2.4998642260902768
Bitstring Collision Estimate: sigma-hat = 0.50000005968623762
Bitstring Collision Estimate: X-bar' = 2.4991442792874854
Bitstring Collision Estimate: Found p.
Bitstring Collision Estimate: p = 0.52068478562270593
Bitstring Collision Estimate: min entropy = 0.94151784300335928
Bitstring Markov Estimate: P_0 = 0.499997625
Bitstring Markov Estimate: P_1 = 0.500002375
Bitstring Markov Estimate: P_{0,0} = 0.50015000075000371
Bitstring Markov Estimate: P_{0,1} = 0.49984999924999629
Bitstring Markov Estimate: P_{1,0} = 0.49984512573565276
Bitstring Markov Estimate: P_{1,1} = 0.50015487426434724
Bitstring Markov Estimate: p-hat_max = 3.0566398886466571e-39
Bitstring Markov Estimate: min entropy = 0.99955663364325387
Bitstring Compression Estimate: X-bar = 5.2184201973491531
Bitstring Compression Estimate: sigma-hat = 1.014198389589603
Bitstring Compression Estimate: X-bar' = 5.2161569417485056
Bitstring Compression Estimate: Found p.
Bitstring Compression Estimate: p = 0.021879816170943034
Bitstring Compression Estimate: min entropy = 0.91904259546282197
Bitstring t-Tuple Estimate: t = 19
Bitstring t-Tuple Estimate: p-hat_max = 0.52459727566326442
Bitstring t-Tuple Estimate: p_u = 0.52505207095682271
Bitstring t-Tuple Estimate: min entropy = 0.92946758870669877
Bitstring LRS Estimate: u = 20
Bitstring LRS Estimate: v = 46
Bitstring LRS Estimate: p-hat = 0.50883294077706853
Bitstring LRS Estimate: p_u = 0.50928821633885868
Bitstring LRS Estimate: min entropy = 0.97344575763652619
Literal t-Tuple Estimate: t = 1
Literal t-Tuple Estimate: p-hat_max = 0.0040990000000000002
Literal t-Tuple Estimate: p_u = 0.0042635751805110058
Literal t-Tuple Estimate: min entropy = 7.8737205884746455
Literal LRS Estimate: u = 2
Literal LRS Estimate: v = 4
Literal LRS Estimate: p-hat = 0.0039067758815502637
Literal LRS Estimate: p_u = 0.0040674613260951939
Literal LRS Estimate: min entropy = 7.9416556560146656
Bitstring MultiMCW Prediction Estimate: C = 3999384
Bitstring MultiMCW Prediction Estimate: r = 23
Bitstring MultiMCW Prediction Estimate: N = 7999937
Bitstring MultiMCW Prediction Estimate: P_global = 0.49992693692462831
Bitstring MultiMCW Prediction Estimate: P_global' = 0.5003822853330937
Bitstring MultiMCW Prediction Estimate: P_local can't change the result.
Bitstring MultiMCW Prediction Estimate: min entropy = 0.99889737915356724
Literal MultiMCW Prediction Estimate: C = 3908
Literal MultiMCW Prediction Estimate: r = 3
Literal MultiMCW Prediction Estimate: N = 999937
Literal MultiMCW Prediction Estimate: P_global = 0.0039082462195118295
Literal MultiMCW Prediction Estimate: P_global' = 0.0040689668428336998
Literal MultiMCW Prediction Estimate: P_local can't change the result.
Literal MultiMCW Prediction Estimate: min entropy = 7.941121760425025
Bitstring Lag Prediction Estimate: C = 3998966
Bitstring Lag Prediction Estimate: r = 23
Bitstring Lag Prediction Estimate: N = 7999999
Bitstring Lag Prediction Estimate: P_global = 0.49987081248385157
Bitstring Lag Prediction Estimate: P_global' = 0.50032615911750056
Bitstring Lag Prediction Estimate: P_local can't change the result.
Bitstring Lag Prediction Estimate: min entropy = 0.999059210530996
Literal Lag Prediction Estimate: C = 3842
Literal Lag Prediction Estimate: r = 3
Literal Lag Prediction Estimate: N = 999999
Literal Lag Prediction Estimate: P_global = 0.0038420038420038422
Literal Lag Prediction Estimate: P_global' = 0.0040013569447458278
Literal Lag Prediction Estimate: P_local can't change the result.
Literal Lag Prediction Estimate: min entropy = 7.9652949532929576
Bitstring MultiMMC Prediction Estimate: C = 4002701
Bitstring MultiMMC Prediction Estimate: r = 22
Bitstring MultiMMC Prediction Estimate: N = 7999998
Bitstring MultiMMC Prediction Estimate: P_global = 0.50033775008443748
Bitstring MultiMMC Prediction Estimate: P_global' = 0.5007930966578571
Bitstring MultiMMC Prediction Estimate: P_local can't change the result.
Bitstring MultiMMC Prediction Estimate: min entropy = 0.99771341976962502
Literal MultiMMC Prediction Estimate: C = 3961
Literal MultiMMC Prediction Estimate: r = 3
Literal MultiMMC Prediction Estimate: N = 999998
Literal MultiMMC Prediction Estimate: P_global = 0.0039610079220158438
Literal MultiMMC Prediction Estimate: P_global' = 0.0041228005601766241
Literal MultiMMC Prediction Estimate: P_local can't change the result.
Literal MultiMMC Prediction Estimate: min entropy = 7.9221596118919653
Bitstring LZ78Y Prediction Estimate: C = 3999623
Bitstring LZ78Y Prediction Estimate: r = 25
Bitstring LZ78Y Prediction Estimate: N = 7999983
Bitstring LZ78Y Prediction Estimate: P_global = 0.49995393740211697
Bitstring LZ78Y Prediction Estimate: P_global' = 0.50040928450438005
Bitstring LZ78Y Prediction Estimate: P_local can't change the result.
Bitstring LZ78Y Prediction Estimate: min entropy = 0.99881953762950504
Literal LZ78Y Prediction Estimate: C = 3960
Literal LZ78Y Prediction Estimate: r = 3
Literal LZ78Y Prediction Estimate: N = 999983
Literal LZ78Y Prediction Estimate: P_global = 0.0039600673211444594
Literal LZ78Y Prediction Estimate: P_global' = 0.0041218420378209292
Literal LZ78Y Prediction Estimate: P_local can't change the result.
Literal LZ78Y Prediction Estimate: min entropy = 7.9224950674169161
H_bitstring = 0.91904259546282197
H_bitstring Per Symbol = 7.3523407637025757
H_original = 7.8737205884746455
Assessed min entropy: 7.3523407637025757

celic commented 1 year ago

It seems like this may have something to do with the "arbitrary" cutoff value as part of the design of the test itself from SP 800-90B Section 6.3.5. The tests look for tuples that occur at least 35 times within the sample to determine if the rate of their occurrences matches what is expected on a 1 million sample file. With significantly more samples there will naturally be many more tuples that occur more than 35 times. The test doesn't define any scaling for the cutoff in these cases. My best guess is that the test is poorly designed for a large number of samples.

joshuaehill commented 1 year ago

The $t$-tuple test checks to see what the average per-symbol min entropy is for the dataset based on the probability of the most common observed $i$-tuple, where $i$ varies from 1 to $t$ (where $t$ is the largest value that occurs at least 35 times). It then takes the "best" (for the attacker, so that is to say "lowest entropy") value as the basis for an entropy estimate.

This is a conceptually intuitive and reasonable looking test. The choice of "35 repeats" is arbitrary to some degree. One doesn't want this value to be less than about 10-20, because at that point it isn't clear that reasonable estimates are made for the P[i] parameters. Indeed, over time such parameters were adjusted (e.g., see the LRS $u$ cutoff in the 2016 version of 90B vs the 2018 final version of 90B.) The length of such a tuple will naturally increase as the file size gets bigger, the probability is always normalized to a per-symbol probability, and no other scaling is necessary.

This implementation works well for large file sizes sizes in my experience. I tend to do testing using sample sets with on the order of 100 million samples, and haven't encountered this problem in a way that ultimately couldn't be explained by dataset problems.

To that end, the relevant result extract is here:

Bitstring t-Tuple Estimate: t = 1053
Bitstring t-Tuple Estimate: p-hat_max = 0.98656274725037851
Bitstring t-Tuple Estimate: p_u = 0.98659895026523092
Bitstring t-Tuple Estimate: min entropy = 0.019464342572138409
Bitstring LRS Estimate: u = 1054
Bitstring LRS Estimate: v = 1086
Bitstring LRS Estimate: p-hat = 0.97284168114925829
Bitstring LRS Estimate: p_u = 0.97289279041778232
Bitstring LRS Estimate: min entropy = 0.039647261365447189
Literal t-Tuple Estimate: t = 101
Literal t-Tuple Estimate: p-hat_max = 0.88457868137893292
Literal t-Tuple Estimate: p_u = 0.88486285476711379
Literal t-Tuple Estimate: min entropy = 0.17647422622243453
Literal LRS Estimate: u = 102
Literal LRS Estimate: v = 134
Literal LRS Estimate: p-hat = 0.8046557603231751
Literal LRS Estimate: p_u = 0.80500835643773239
Literal LRS Estimate: min entropy = 0.31292433559960725

This value for $t$ is surprisingly large, as is the longest repeated substring. A repeated binary substring of size 1086 is very unlikely if this were an ideal source (the value of this parameter is expected to be approximately 50 for this data size; 1000 rounds of simulation with an ideal-looking source produced a LRS no larger than 57). More concerningly, the average per-symbol prediction probability was very high for the identified most likely tuple, so the corresponding entropy result was quite low. This (and the LRS estimate) both show that this data set suggests that the source commonly repeats large strings, so it correspondingly gets a low entropy assessment.

It looks like the tool is behaving correctly here.

celic commented 1 year ago

Thanks for the analysis. It's been too long since I've looked at this code. v represents the length of the largest repeated substring, correct? My local run of the file ended up with Bitstring LRS Estimate: v = 16777216 which is about a quarter of the entire file, causing the min-entropy to be 0 bits. I'd agree the code is operating correctly. I'm going to close this issue.

joshuaehill commented 1 year ago

Yes, v is the length of the longest repeated substring.

joshuaehill commented 1 year ago

@celic that LRS isn't consistent with the results that @smuellerDD reported above; in the run results he provided, the LRS for the bitstring version of A14.bin was v=1086. I think that your value is the LRS for A10.bin (which is being discussed in #214).

All that being said, I agree that the code seems to be operating correctly.

celic commented 1 year ago

It may be a different file. I was provided a file separately from the URL on his website.

smuellerDD commented 1 year ago

Am Dienstag, 7. März 2023, 21:50:51 CET schrieb Chris Celi:

Hi Chris,

It may be a different file. I was provided a file separately from the URL on his website.

I looked at the request to provide the A10.bin.

The data I provided in the listing was about the A14.bin. I have now also uploaded this file to https://www.chronox.de/A14.bin

However, considering the discussion here, it seems that there is a long running string issue on the data and the calculation of the tool is proper. We will investigate.

Thank you very much.

Ciao Stephan

smuellerDD commented 1 year ago

Thanks for your help, we have identified an issue in the data collection causing duplicates.

usnistgov / SP800-90B_EntropyAssessment

poor entropy value for a large data file but good entropy value for each segment of the data file #213