Closed exarkun closed 9 months ago
Over the course of many trials, I managed to observe a similar problem exactly once with ZFEC's own test suite:
Test suite tests: RUNNING...
secureCombine
is the inverse of secureDivide n
+++ OK, passed 1 test.
deFEC
is the inverse of enFEC
+++ OK, passed 2000 tests.
decode
is (nearly) the inverse of encode FAILED [1]
works with total=255
+++ OK, passed 100 tests.
works with required=255
+++ OK, passed 100 tests.
encode
returns copies of the primary block for all 1 of N encodings FAILED [2]
Failures:
haskell/test/FECTest.hs:78:5:
1) decode is (nearly) the inverse of encode
Falsifiable (after 3 tests and 2 shrinks):
Params {required = 222, total = 253}
1
0
expected: ["\NUL","\SOH","\STX","\ETX","\EOT","\ENQ","\ACK","\a","\b","\t","\n","\v","\f","\r","\SO","\SI","\DLE","\DC1","\DC2","\DC3","\DC4","\NAK","\SYN","\ETB","\CAN","\EM","\SUB","\ESC","\FS","\GS","\RS","\US"," ","!","\"","#","$","%","&",
"'","(",")","*","+",",","-",".","/","0","1","2","3","4","5","6","7","8","9",":",";","<","=",">","?","@","A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z","[","\\","]","^","_","`","a","b","c","d","
e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","{","|","}","~","\DEL","\128","\129","\130","\131","\132","\133","\134","\135","\136","\137","\138","\139","\140","\141","\142","\143","\144","\145","\146","\147",
"\148","\149","\150","\151","\152","\153","\154","\155","\156","\157","\158","\159","\160","\161","\162","\163","\164","\165","\166","\167","\168","\169","\170","\171","\172","\173","\174","\175","\176","\177","\178","\179","\180","\181","\182","\183
","\184","\185","\186","\187","\188","\189","\190","\191","\192","\193","\194","\195","\196","\197","\198","\199","\200","\201","\202","\203","\204","\205","\206","\207","\208","\209","\210","\211","\212","\213","\214","\215","\216","\217","\218","\2
19","\220","\221"]
but got: ["\236","\SOH","\STX","\ETX","\EOT","\ENQ","\ACK","\a","\b","\t","\n","\216","\f","\r","\SO","\SI","\DLE","s","\DC2","\DC3","\DC4","\NAK","\214","\ETB","\CAN","\EM","\SUB","\ESC","\FS","\GS","\RS","\248"," ","\195","\"","\135","$","%
","&","'","(","?","*","e",",","-",".","/","0","1","2","3","4","5","6","7","8","9",":",";","<","=",">","?","@","A","B","C","D","E","F","G","H","I","J","K","\DC3","M","N","\173","\244","Q","R","S","T","U","V","W","X","Y","Z","]","\\","]","^","_","`","a
","b","c","d","8","f","g","A","i","j","k","l","m","n","o","\252","q","r","s","t","u","D","w","x","y","z","{","|","}","~","\DEL","\128","\129","Y","\131","\132","\133","\134","\135","\136","\137","\138","\139","\140","\141","\142","\143","\144","\194"
,"\146","\147","\148","\149","\150","\151","\152","\153","\154","\155","\156","!","\158","\159","\160","\161","\162","\163","\164","\165","\166","\167","\168","\169","\170","\171","\172","\173","\174","\175","\176","\177","\178","\179","\180","\181",
"\182","\183","h","\185","\186","\187","\188","\189","\190","\230","\192","\193","\194","\195","\196","\197","\198","\199","\200","\201","\202","\203","\238","\205","\206","\207","\208","\209","\210","\191","\212","\213","\214","\215","\216","\217","
\218","\219","\215","\221"]
To rerun use: --match "/decode/is (nearly) the inverse of encode/"
haskell/test/FECTest.hs:131:13:
2) encode returns copies of the primary block for all 1 of N encodings
Assertion failed (after 5 tests and 3 shrinks):
Params {required = 3, total = 93}
"\175"
To rerun use: --match "/encode/returns copies of the primary block for all 1 of N encodings/"
Randomized with seed 1246612168
Ah that result was from a new test in the remove-unsafePerformIO
branch, which hopefully is not just buggy... (but it did also pass many, many times here).
Ahh nice, and I can make that test fail much more often by running 10 instances of it in parallel.
The cause seems to be that the C library initialization code is not thread-safe. If you manage to get two threads into it at once (just call fec_new
from two threads at the same time) then it will corrupt the various pre-computed tables and your results will be junk.
This doesn't impact the Python bindings because they never release the GIL so you cannot get two threads into fec_new
at the same time (unless you use the Python bindings and have some other code in the same process using the C API directly ...)
Specifically, this is not thread-safe:
static int fec_initialized = 0;
static void
init_fec (void) {
generate_gf();
_init_mul_table();
fec_initialized = 1;
}
...
fec_t *
fec_new(unsigned short k, unsigned short n) {
...
if (fec_initialized == 0)
init_fec ();
...
}
This could be solved by:
fec_new
, such as at library initialization time (ie, with __attribute ((constructor))
)The APIs for using a mutex differ between Windows and POSIX so that's a pain. The APIs for library initialization time differ between gcc/clang and Visual Studio so that's a pain.
Make it the application's responsibility to call the initialization function (from a single thread)
I did this in #91
The tahoe-chk test suite exercises the Haskell bindings for encoding and decoding. Sometimes the test suite fails as a result of getting garbage from the ZFEC encoder. This manifests as CHK shares being miscomputed and is visible as CHK capabilities with the wrong fingerprints.
Apparently sometimes the encoder output contains extra copies of the primary shares instead of generated secondary shares.
The tahoe-chk test suite failures look like this:
The behavior is non-deterministic. It seems that generally either all of the tests beneath
chkCap
pass or fail together. Threading options seem to make a difference to the result. I don't think I've ever observed a failure with only one test runner thread.