Closed dj-on-github closed 5 years ago
It would appear that the link http://deadhat.com/ttuple_silly.bin isn't working.
Try http://deadhat.com/ttuple_silly.binary My apache config had opinions about binary files.
The SA / LCP code seems to think that there is a repeated substring of length 6373 in ttuple_silly.nob, where the strings start at indexes 78259981 and 78259996. Extracting the relevant chunks from your data file:
dd if=ttuple_silly.nob of=string1.bin skip=78259981 count=6373 bs=1
dd if=ttuple_silly.nob of=string2.bin skip=78259996 count=6373 bs=1
if cmp string1.bin string2.bin; then
echo "Files Match"
fi
This indicates that the files match. So, there is, indeed, such a repeated substring of the noted length.
You ran a check of the t-tuple test, but just looking at the results, there's no reason to think that t should indeed equal to 16 for this full data set (given how large v is!)
It would appear that your tool is truncating to just over 1 million data elements (whereas the NIST tool results that you point out are on the full non-truncated data set). That's one possible issue.
Next, looking at the conversion between the files, it would seem that something bit odd is going on there. If I decode your binary file bytewise, extracting from low order bit to high order bit, I get the same output for the first 59999 bytes, but then there's a mismatch at byte 60000.
Looking at the corresponding (NIST formatted) files, I get:
--- ttuple_silly.nob.xxd 2019-06-03 17:05:37.164540848 -0700
+++ ttuple_silly.nob.local-lth.xxd 2019-06-03 17:06:01.052712275 -0700
@@ -3747,353 +3747,353 @@
-000ea50: XXXX XXXX XXXX XXXX 0001 0001 0000 01**01** ................
-000ea60: 0001 0101 0000 0000 0001 0100 0100 0101 ................
+000ea50: XXXX XXXX XXXX XXXX 0001 0001 0000 01**00** ................
+000ea60: 0100 0101 0101 0000 0101 0100 0101 0101 ................
That's bit 60000 (as expected), which occurs in byte 7500. Examining the binary file, we find
xxd -b ttuple_silly.binary
...
0001d46: 10000101 00010011 00000000 00110001 10000101 **01001010** ...1.J
or (in low to high one bit per byte convention) 0001 0001 0000 0100
. This suggests that there's something going on in your binary conversion, as well.
When I take the full binary file, convert it (using the same conventions), I get the following results:
Literal t-Tuple Estimate: t = 22
Literal t-Tuple Estimate: p-hat_max = 0.52078609105013651
Literal t-Tuple Estimate: p_u = 0.52092658788088819
Literal t-Tuple Estimate: min entropy = 0.9408480213522219
Literal LRS Estimate: u = 23
Literal LRS Estimate: v = 52
Literal LRS Estimate: p-hat = 0.50643643229500634
Literal LRS Estimate: p_u = 0.50657703903845108
Literal LRS Estimate: min entropy = 0.98114640752520621
In summary, that raw data seems basically fine, but the converted data seems to have problems. The NIST tool appears to be working correctly, so far as I can tell.
OK. I m away from my computer right now (In Fiji doing better things than work) . I'll check the converter program
Where is the pointer to conversion code that converts the binary file for desired NIST format(1bit) to calculate Min entropy ? ea_non_iid command for 8 bit per symbol seems to work fine but 1 bit had the ttuple with low value as described by @dj-on-github
The NIST 90B tool takes data encoded as 1 symbol per byte. Converting your data to be in that format is dependent on how you've decided to pack your symbols (in particular, the byte and bit ordering is something to watch!) Once fixed, most packing conventions are fairly straight forward to convert to the format used by the NIST tool using somewhat trivial C programs. As an example, some simple (but rather inefficient) code to accomplish the low-to-high bit ordering byte-wise conversion that we were discussing here is as follows:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main(void) {
int c;
while((c = getc(stdin))!=EOF) {
for(int i = 0; i<8; i++) {
char output = (((1<<i)&c)==0)?0:1;
if(putc(output, stdout) != output) {
perror("Can't write character");
exit(-1);
}
}
}
return 0;
}
Generating a program that works for all packing schemes is rather obnoxious, so I'm not sure that's a good way to proceed. Including a menu of toy examples within the 90B tool's distribution would seem to enable folks who are inclined to avoid thinking about what their data format is. Clearly the tools must exist somewhere, but I'm not sure that including them within this tool is the "right place".
I've re-downloaded the ttuple_silly.nob file I posted on deadhat and tested for the longest run and got the answer 28..
wget http://deadhat.com/ttuple_silly.nob .
--2019-06-04 14:55:42-- http://deadhat.com/ttuple_silly.nob
Resolving deadhat.com (deadhat.com)... 50.126.79.253
Connecting to deadhat.com (deadhat.com)|50.126.79.253|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 83886080 (80M) [video/unknown]
Saving to: ‘ttuple_silly.nob.1’
ttuple_silly.nob.1 100%[============================================================================================>] 80.00M 83.7KB/s in 17m 33s
2019-06-04 15:13:16 (77.8 KB/s) - ‘ttuple_silly.nob.1’ saved [83886080/83886080]
--2019-06-04 15:13:16-- http://./
Resolving . (.)... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘.’
FINISHED --2019-06-04 15:13:16--
Total wall clock time: 17m 34s
Downloaded: 1 files, 80M in 17m 33s (77.8 KB/s)
$
$ djent -b -l 8 ttuple_silly.nob.1
opening ttuple_silly.nob.1 as binary
Symbol Size(bits) = 8
Min Entropy (by max occurrence of symbol 1) = 0.124988
Shannon IID Entropy = 1.000000 bits per symbol
Optimal compression would compress by 87.500000 percent
Chi square: symbol count=83886080, distribution=10653532207.48, randomly exceeds 0.00 percent of the time
Mean = 0.500033
Monte Carlo value for Pi is 4.000000 (error 27.32 percent).
Serial Correlation = -0.000092
Longest Run Symbol = 1. Run Length = 28
Position of Longest Run = 83886080.
With reversing the bit order, it doesn't change, except for 0x01 --> 0x80
Longest Run Symbol = 80. Run Length = 28
Position of Longest Run = 83886080.
I also added some checks into bin2nistoddball.c and saw nothing unusual.
bin2nistoddball is available at https://github.com/dj-on-github/hexbinhex
This reads in the data byte by byte, converts it to a bitstream FIFO and pulls it back out in the chosen symbol size and writes it to a file. It uses 2048 bit blocks for efficiency. For the 1 bit analysis where the problem occurs, we are encoding each byte to a series of 8 bytes made of 0x00 or 0x01.
Reversing bit order of the input symbols surely does affect many of the tests, as the intra-symbol ordering changes. Encoding the final symbols as {0x00, 0x01} or {0x00, 0x80} shouldn't change anything in the non-iid test suite (though this style of translation may change permutation tests results).
Not all repeated substrings are runs (though a run is one way of getting a repeated substring of length 1 less than the longest run).
I'll take a look at the code you've linked to, but I encourage you to try the (trivial and easy to verify) code that I included above and check to see if the outputs match. If they don't that suggests a problem in bin2nistoddball.
Some observations:
bin2nistoddball
seems to reverse the order of the bits (e.g., if you were to tell it to output 8 bits per symbol, it would seem to reverse each byte. This doesn't affect the case where the output symbols are a single bit wide (the default case).outbuffer
isn't large enough to hold all the data. It should be at least BITFIFO_SIZE = 20000 bytes. This causes a buffer overrun for this data. This can be confirmed using the "-fsanitize=address" option for modern versions of gcc or clang. This certainly would explain the apparently invalid (and changing) behavior of the ttuple_silly.nob file.After fixing this second problem, the output matches my toy converter's output.
It's fixed! You are correct - the output buffer was too small.
# ea_non_iid -i -t -v rawdata_CID-BDX_PROC-noxor_1p0V_50p0C_.nob 1
Opening file: 'rawdata_CID-BDX_PROC-noxor_1p0V_50p0C_.nob'
Loaded 83886080 samples of 2 distinct 1-bit-wide symbols
Running non-IID tests...
Running Most Common Value Estimate...
Literal MCV Estimate: mode = 41945640, p-hat = 0.5000309944152832, p_u = 0.50017161280992273
Most Common Value Estimate = 0.999505 / 1 bit(s)
Running Entropic Statistic Estimates (bit strings only)...
Literal Collision Estimate: X-bar = 2.5000281635119261, sigma-hat = 0.50000000665748134, p = 0.50985329758202624
Collision Test Estimate = 0.971846 / 1 bit(s)
Literal Markov Estimate: P_0 = 0.4999690055847168, P_1 = 0.5000309944152832, P_0,0 = 0.49995807149276622, P_0,1 = 0.50004192850723372, P_1,0 = 0.49997995024035874, P_1,1 = 0.50002004975964121, p_max = 2.9539227679126115e-39
Markov Test Estimate = 0.999942 / 1 bit(s)
Literal Compression Estimate: X-bar = 5.21730324960311, sigma-hat = 1.0153869359403529, p = 0.020853351773189255
tCompression Test Estimate = 0.930596 / 1 bit(s)
Running Tuple Estimates...
Literal t-Tuple Estimate: t = 22, p-hat_max = 0.52078609105013651, p_u = 0.52092658788088819
Literal LRS Estimate: u = 23, v = 52, p-hat = 0.50643643229500634, p_u = 0.50657703903845108
T-Tuple Test Estimate = 0.940848 / 1 bit(s)
LRS Test Estimate = 0.981146 / 1 bit(s)
Running Predictor Estimates...
Literal MultiMCW Prediction Estimate: N = 83886017, Pglobal' = 0.50009418043348641 (C = 41939113) Plocal can't affect result (r = 29)
Multi Most Common in Window (MultiMCW) Prediction Test Estimate = 0.999728 / 1 bit(s)
Literal Lag Prediction Estimate: N = 83886079, Pglobal' = 0.50008439333277199 (C = 41938323) Plocal can't affect result (r = 27)
Lag Prediction Test Estimate = 0.999757 / 1 bit(s)
Literal MultiMMC Prediction Estimate: N = 83886078, Pglobal' = 0.50004855305768714 (C = 41935316) Plocal can't affect result (r = 28)
Multi Markov Model with Counting (MultiMMC) Prediction Test Estimate = 0.999860 / 1 bit(s)
Literal LZ78Y Prediction Estimate: N = 83886063, Pglobal' = 0.50012254031664238 (C = 41941515) Plocal can't affect result (r = 28)
LZ78Y Prediction Test Estimate = 0.999646 / 1 bit(s)
H_original: 0.930596
Thank you.
No problem. I hope you have a great remainder of your vacation! :-)
Yes. Back to the chardonnay and tanning. Thank you for your help.
We have a 10MiB binary file from a real Intel entropy source. The file looks good and the statistics look to be as expected. Min entropy is > 0.9 by all sensible estimation metrics. A number of other files from the same source at the same time (at different temperatures) give expected results.
The files are at http://deadhat.com/ttuple_silly.bin And converted to 1 bit per byte at http://deadhat.com/ttuple_silly.nob using
bin2nistoddball ttuple_silly.bin -o ttuple_silly.nob
The ea_non_iid with 1 bit per symbol (it's a 1 bit output ES) test gives a min entropy of 0.003585 per bit.
Running against a python implementation that searches the tuples the inefficient way, we get t=16, which leads to a sensible result.
The ea_non_iid algorithm seems to find t = 5878, which seems to be wrong and explains the result.
I don't know if this is related to the incorrect truncation that is reported : "Loaded 83886080 samples".