vernamlab / cuFHE

CUDA-accelerated Fully Homomorphic Encryption Library
MIT License
213 stars 60 forks source link

test_api_gpu FAIL every time. Build seems alright (compiles and runs) #3

Closed ibarrond closed 6 years ago

ibarrond commented 6 years ago

The whole build seems to be working for me, but the simple test results for gpu output FAIL at every time. No clue on why this happens (I guess it shouldn't?)


System setup:

Ubuntu LTS 16.04
NVidia drivers, 384.81
CUDA Toolkit, v9.0
Tesla V100-SXM2 (Compute Capability 7.0)

Output of every test:

------ Key Generation ------ ------ Test Encryption/Decryption ------ Number of tests: 2560 PASS ------ Initilizating Data on GPU(s) ------ ------ Test NAND Gate ------ Number of tests: 2560 0.178565 ms / gate FAIL ------ Cleaning Data on GPU(s) ------

WeiDaiWD commented 6 years ago

I just noticed the same issue last night. My GPU has only 30 SMs and gives an roughly 1 failure every 2000~3000 tests. There might be a read-write race caused by concurrently executed kernels or memory accesses somewhere in the code. I am working on this.

Meanwhile, do you have ECC switched on? If it wasn't, could you please switch it on and run the test again? I would like to exclude the affect of ECC. I do not have access to such a GPU. Much appreciated.

ibarrond commented 6 years ago

ECC seems to be switched on. When I run nvidia-smi -e 1 I get the message "ECC support is already Enabled for GPU ..."

WeiDaiWD commented 6 years ago

@ibarrond Could you please verify this fix? It works on my device.

ibarrond commented 6 years ago

I recompiled the library and the tests, rerun test_api_gpu and still obtained the same output:

------ Key Generation ------ ------ Test Encryption/Decryption ------ Number of tests: 2560 PASS ------ Initilizating Data on GPU(s) ------ ------ Test NAND Gate ------ Number of tests: 2560 0.172046 ms / gate FAIL ------ Cleaning Data on GPU(s) ------

WeiDaiWD commented 6 years ago

Hi @ibarrond, I launched 7000+ gates per test and tested multiple times. They all passed. I now added a counter for the failures. Could you please download the new commit and send me the output? Make sure that you executed make clean before make. Thanks a ton.

fjarri commented 6 years ago

Perhaps that second synchronize in the end of NTT/NTTInv functions was necessary, after all. Whoever still sees the fails, could you check?

Namely, I mean another __syncthreads() anywhere between the lines in NTT1024Core()

for (int i = 0; i < 8; i ++)
    r[i] = ptr[i << 2];

(the last access of the shared memory) and the end of the function. Plus the similar change in NTTInv1024Core, and corresponding additional __syncthreads() in Accumulate().

ibarrond commented 6 years ago

I did just as you said: make clean --> make --> ./bin/test_api_gpu. It still gives an error. It looks like it fails only in 9 of the 2560 cases, here we have the output:

------ Key Generation ------
------ Test Encryption/Decryption ------
Number of tests:        2560
PASS
------ Initilizating Data on GPU(s) ------
------ Test NAND Gate ------
Number of tests:        2560
0.180913 ms / gate
FAIL:   9/2560
------ Cleaning Data on GPU(s) ------
WeiDaiWD commented 6 years ago

@ibarrond I think that 1/200 error rate may be caused by hidden memory copies before and after each kernel. I now have some changes at this branch. It runs the same number of tests but executes 8 sequential Nand gates in each test. And now the code relies on memory copies between host and device rather that previously unified memory. Please again test it on your mighty V100. I wish I had a capability 7.0 device so that I don't have to bother you for testing. Much appreciated.

WeiDaiWD commented 6 years ago

@fjarri I will try that later. But a race of shared memory accesses should be detected on my GPU as well and with a error rate much lower than 9/2560. I suspect that there is some issue with unified memory and streams on devices with compute capability 7.0+. If my current fix still fails, I will thoroughly check the data path inside kernels. Thanks again.

ibarrond commented 6 years ago

I tried again with the changes from branch fix-old-gpu, but I stull get errors. Just to be sure, I supressed other outputs and run the tests 30 times. All of them failed:

FAIL: 11/2560 FAIL: 8/2560 FAIL: 15/2560 FAIL: 7/2560 FAIL: 4/2560 FAIL: 10/2560 FAIL: 8/2560 FAIL: 16/2560 FAIL: 15/2560 FAIL: 17/2560 FAIL: 2/2560 FAIL: 8/2560 FAIL: 14/2560 FAIL: 4/2560 FAIL: 7/2560 FAIL: 7/2560 FAIL: 6/2560 FAIL: 7/2560 FAIL: 6/2560 FAIL: 2/2560 FAIL: 11/2560 FAIL: 2/2560 FAIL: 11/2560 FAIL: 11/2560 FAIL: 5/2560 FAIL: 12/2560 FAIL: 1/2560 FAIL: 9/2560 FAIL: 5/2560 FAIL: 12/2560

WeiDaiWD commented 6 years ago

I have just examined potential race-conditions in CUDA device code again. I added another synchronization. Please again recompile and test the same branch. If this still fails, I believe the error is then caused by the synchronization between host and device. Thanks a lot.

ibarrond commented 6 years ago

With the latest changes applied I get an even worse error rate: FAIL: 519/20480 FAIL: 479/20480 FAIL: 484/20480 FAIL: 494/20480 FAIL: 428/20480 FAIL: 502/20480 FAIL: 486/20480 FAIL: 534/20480 FAIL: 470/20480 FAIL: 557/20480 FAIL: 517/20480 FAIL: 498/20480 FAIL: 527/20480 FAIL: 445/20480 FAIL: 447/20480 FAIL: 500/20480 FAIL: 475/20480 FAIL: 648/20480 FAIL: 493/20480 FAIL: 555/20480 FAIL: 521/20480 FAIL: 541/20480 FAIL: 566/20480 FAIL: 494/20480 FAIL: 498/20480 FAIL: 454/20480 FAIL: 392/20480 FAIL: 493/20480 FAIL: 432/20480 FAIL: 573/20480

WeiDaiWD commented 6 years ago

@ibarrond Sorry for this delayed reply. I was taken away from this project for a while. We have tested on a V100 recently and couldn't see the same error here. Do you still see this now?

ibarrond commented 6 years ago

Not anymore! Closing the issue.