tahoe-lafs / zfec

zfec -- an efficient, portable erasure coding tool
Other
373 stars 44 forks source link

Using multithreading, sometimes the results are wrong #89

Closed exarkun closed 9 months ago

exarkun commented 1 year ago

The tahoe-chk test suite exercises the Haskell bindings for encoding and decoding. Sometimes the test suite fails as a result of getting garbage from the ZFEC encoder. This manifests as CHK shares being miscomputed and is visible as CHK capabilities with the wrong fingerprints.

Apparently sometimes the encoder output contains extra copies of the primary shares instead of generated secondary shares.

The tahoe-chk test suite failures look like this:

❯ while cabal run tests -- --num-threads 4 --pattern 'chkCap'; do :; done                                              
Up to date                                                                                                                                                                        
CHK                                                                                                                                                                               
  Vectors                                                                                                                                                                         
    chkCap                                                                                                                                                                        
      URI:CHK:rl4bzmselnuezmapjlzssnqg2e:p7kvin2fnemochuxsmh6ot75qpbfhrscbxi5i74bhqdhzcy6i5eq:1:3:56:         FAIL (0.02s)
        test/SpecCHK.hs:367:                                                                                                                                                      
        yes                                                                                                                                                                       
        expected: Right URI:CHK:rl4b...:p7kv...:1:3:56                                                                                                                            
         but got: Right URI:CHK:rl4b...:kb2z...:1:3:56                                                                                                                            
        Use -p '/chkCap/&&/URI:CHK:rl4bzmselnuezmapjlzssnqg2e:p7kvin2fnemochuxsmh6ot75qpbfhrscbxi5i74bhqdhzcy6i5eq:1:3:56/' to rerun this test only.        
...

The behavior is non-deterministic. It seems that generally either all of the tests beneath chkCap pass or fail together. Threading options seem to make a difference to the result. I don't think I've ever observed a failure with only one test runner thread.

exarkun commented 9 months ago

Over the course of many trials, I managed to observe a similar problem exactly once with ZFEC's own test suite:

Test suite tests: RUNNING...                                                                                                                                                                                                                              

secureCombine                                                                                                                                                                                                                                             
  is the inverse of secureDivide n                                                                                                                                                                                                                        
    +++ OK, passed 1 test.                                                               
deFEC                                     
  is the inverse of enFEC                                                                                                    
    +++ OK, passed 2000 tests.
decode                                                                                                                                                                            
  is (nearly) the inverse of encode FAILED [1]                                                                               
  works with total=255                                                                                                       
    +++ OK, passed 100 tests.                                                                                                
  works with required=255                                                                
    +++ OK, passed 100 tests.                                                                                                
encode                           
  returns copies of the primary block for all 1 of N encodings FAILED [2]                

Failures:                                                                                                                    

  haskell/test/FECTest.hs:78:5:                                                                                              
  1) decode is (nearly) the inverse of encode                                                                                
       Falsifiable (after 3 tests and 2 shrinks):                                                                            
         Params {required = 222, total = 253}                                                                                
         1                                                                               
         0                                                                               
       expected: ["\NUL","\SOH","\STX","\ETX","\EOT","\ENQ","\ACK","\a","\b","\t","\n","\v","\f","\r","\SO","\SI","\DLE","\DC1","\DC2","\DC3","\DC4","\NAK","\SYN","\ETB","\CAN","\EM","\SUB","\ESC","\FS","\GS","\RS","\US"," ","!","\"","#","$","%","&",
"'","(",")","*","+",",","-",".","/","0","1","2","3","4","5","6","7","8","9",":",";","<","=",">","?","@","A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z","[","\\","]","^","_","`","a","b","c","d","
e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z","{","|","}","~","\DEL","\128","\129","\130","\131","\132","\133","\134","\135","\136","\137","\138","\139","\140","\141","\142","\143","\144","\145","\146","\147",
"\148","\149","\150","\151","\152","\153","\154","\155","\156","\157","\158","\159","\160","\161","\162","\163","\164","\165","\166","\167","\168","\169","\170","\171","\172","\173","\174","\175","\176","\177","\178","\179","\180","\181","\182","\183
","\184","\185","\186","\187","\188","\189","\190","\191","\192","\193","\194","\195","\196","\197","\198","\199","\200","\201","\202","\203","\204","\205","\206","\207","\208","\209","\210","\211","\212","\213","\214","\215","\216","\217","\218","\2
19","\220","\221"]                                                                                                                                                                
        but got: ["\236","\SOH","\STX","\ETX","\EOT","\ENQ","\ACK","\a","\b","\t","\n","\216","\f","\r","\SO","\SI","\DLE","s","\DC2","\DC3","\DC4","\NAK","\214","\ETB","\CAN","\EM","\SUB","\ESC","\FS","\GS","\RS","\248"," ","\195","\"","\135","$","%
","&","'","(","?","*","e",",","-",".","/","0","1","2","3","4","5","6","7","8","9",":",";","<","=",">","?","@","A","B","C","D","E","F","G","H","I","J","K","\DC3","M","N","\173","\244","Q","R","S","T","U","V","W","X","Y","Z","]","\\","]","^","_","`","a
","b","c","d","8","f","g","A","i","j","k","l","m","n","o","\252","q","r","s","t","u","D","w","x","y","z","{","|","}","~","\DEL","\128","\129","Y","\131","\132","\133","\134","\135","\136","\137","\138","\139","\140","\141","\142","\143","\144","\194"
,"\146","\147","\148","\149","\150","\151","\152","\153","\154","\155","\156","!","\158","\159","\160","\161","\162","\163","\164","\165","\166","\167","\168","\169","\170","\171","\172","\173","\174","\175","\176","\177","\178","\179","\180","\181",
"\182","\183","h","\185","\186","\187","\188","\189","\190","\230","\192","\193","\194","\195","\196","\197","\198","\199","\200","\201","\202","\203","\238","\205","\206","\207","\208","\209","\210","\191","\212","\213","\214","\215","\216","\217","
\218","\219","\215","\221"]                                                                                                                                                       

  To rerun use: --match "/decode/is (nearly) the inverse of encode/"                                                                                                              

  haskell/test/FECTest.hs:131:13:                                                                                                                                                 
  2) encode returns copies of the primary block for all 1 of N encodings                                                                                                          
       Assertion failed (after 5 tests and 3 shrinks):                                                                                                                            
         Params {required = 3, total = 93}                                                                                   
         "\175"                                                                                                              

  To rerun use: --match "/encode/returns copies of the primary block for all 1 of N encodings/"                                                                                                                                                           

Randomized with seed 1246612168                                                          
exarkun commented 9 months ago

Ah that result was from a new test in the remove-unsafePerformIO branch, which hopefully is not just buggy... (but it did also pass many, many times here).

exarkun commented 9 months ago

Ahh nice, and I can make that test fail much more often by running 10 instances of it in parallel.

exarkun commented 9 months ago

The cause seems to be that the C library initialization code is not thread-safe. If you manage to get two threads into it at once (just call fec_new from two threads at the same time) then it will corrupt the various pre-computed tables and your results will be junk.

This doesn't impact the Python bindings because they never release the GIL so you cannot get two threads into fec_new at the same time (unless you use the Python bindings and have some other code in the same process using the C API directly ...)

exarkun commented 9 months ago

Specifically, this is not thread-safe:

static int fec_initialized = 0;
static void
init_fec (void) {
    generate_gf();
    _init_mul_table();
    fec_initialized = 1;
}

...

fec_t *
fec_new(unsigned short k, unsigned short n) {
...
    if (fec_initialized == 0)
        init_fec ();
...
}
exarkun commented 9 months ago

This could be solved by:

The APIs for using a mutex differ between Windows and POSIX so that's a pain. The APIs for library initialization time differ between gcc/clang and Visual Studio so that's a pain.

exarkun commented 9 months ago

Make it the application's responsibility to call the initialization function (from a single thread)

I did this in #91