Closed ibarrond closed 6 years ago
I just noticed the same issue last night. My GPU has only 30 SMs and gives an roughly 1 failure every 2000~3000 tests. There might be a read-write race caused by concurrently executed kernels or memory accesses somewhere in the code. I am working on this.
Meanwhile, do you have ECC switched on? If it wasn't, could you please switch it on and run the test again? I would like to exclude the affect of ECC. I do not have access to such a GPU. Much appreciated.
ECC seems to be switched on. When I run nvidia-smi -e 1
I get the message "ECC support is already Enabled for GPU ..."
@ibarrond Could you please verify this fix? It works on my device.
I recompiled the library and the tests, rerun test_api_gpu and still obtained the same output:
------ Key Generation ------ ------ Test Encryption/Decryption ------ Number of tests: 2560 PASS ------ Initilizating Data on GPU(s) ------ ------ Test NAND Gate ------ Number of tests: 2560 0.172046 ms / gate FAIL ------ Cleaning Data on GPU(s) ------
Hi @ibarrond, I launched 7000+ gates per test and tested multiple times. They all passed. I now added a counter for the failures. Could you please download the new commit and send me the output? Make sure that you executed make clean
before make
. Thanks a ton.
Perhaps that second synchronize in the end of NTT/NTTInv functions was necessary, after all. Whoever still sees the fails, could you check?
Namely, I mean another __syncthreads()
anywhere between the lines in NTT1024Core()
for (int i = 0; i < 8; i ++)
r[i] = ptr[i << 2];
(the last access of the shared memory) and the end of the function. Plus the similar change in NTTInv1024Core
, and corresponding additional __syncthreads()
in Accumulate()
.
I did just as you said: make clean
--> make
--> ./bin/test_api_gpu
. It still gives an error. It looks like it fails only in 9 of the 2560 cases, here we have the output:
------ Key Generation ------
------ Test Encryption/Decryption ------
Number of tests: 2560
PASS
------ Initilizating Data on GPU(s) ------
------ Test NAND Gate ------
Number of tests: 2560
0.180913 ms / gate
FAIL: 9/2560
------ Cleaning Data on GPU(s) ------
@ibarrond I think that 1/200 error rate may be caused by hidden memory copies before and after each kernel. I now have some changes at this branch. It runs the same number of tests but executes 8 sequential Nand gates in each test. And now the code relies on memory copies between host and device rather that previously unified memory. Please again test it on your mighty V100. I wish I had a capability 7.0 device so that I don't have to bother you for testing. Much appreciated.
@fjarri I will try that later. But a race of shared memory accesses should be detected on my GPU as well and with a error rate much lower than 9/2560. I suspect that there is some issue with unified memory and streams on devices with compute capability 7.0+. If my current fix still fails, I will thoroughly check the data path inside kernels. Thanks again.
I tried again with the changes from branch fix-old-gpu
, but I stull get errors. Just to be sure, I supressed other outputs and run the tests 30 times. All of them failed:
FAIL: 11/2560 FAIL: 8/2560 FAIL: 15/2560 FAIL: 7/2560 FAIL: 4/2560 FAIL: 10/2560 FAIL: 8/2560 FAIL: 16/2560 FAIL: 15/2560 FAIL: 17/2560 FAIL: 2/2560 FAIL: 8/2560 FAIL: 14/2560 FAIL: 4/2560 FAIL: 7/2560 FAIL: 7/2560 FAIL: 6/2560 FAIL: 7/2560 FAIL: 6/2560 FAIL: 2/2560 FAIL: 11/2560 FAIL: 2/2560 FAIL: 11/2560 FAIL: 11/2560 FAIL: 5/2560 FAIL: 12/2560 FAIL: 1/2560 FAIL: 9/2560 FAIL: 5/2560 FAIL: 12/2560
I have just examined potential race-conditions in CUDA device code again. I added another synchronization. Please again recompile and test the same branch. If this still fails, I believe the error is then caused by the synchronization between host and device. Thanks a lot.
With the latest changes applied I get an even worse error rate: FAIL: 519/20480 FAIL: 479/20480 FAIL: 484/20480 FAIL: 494/20480 FAIL: 428/20480 FAIL: 502/20480 FAIL: 486/20480 FAIL: 534/20480 FAIL: 470/20480 FAIL: 557/20480 FAIL: 517/20480 FAIL: 498/20480 FAIL: 527/20480 FAIL: 445/20480 FAIL: 447/20480 FAIL: 500/20480 FAIL: 475/20480 FAIL: 648/20480 FAIL: 493/20480 FAIL: 555/20480 FAIL: 521/20480 FAIL: 541/20480 FAIL: 566/20480 FAIL: 494/20480 FAIL: 498/20480 FAIL: 454/20480 FAIL: 392/20480 FAIL: 493/20480 FAIL: 432/20480 FAIL: 573/20480
@ibarrond Sorry for this delayed reply. I was taken away from this project for a while. We have tested on a V100 recently and couldn't see the same error here. Do you still see this now?
Not anymore! Closing the issue.
The whole build seems to be working for me, but the simple test results for gpu output FAIL at every time. No clue on why this happens (I guess it shouldn't?)
System setup:
Output of every test:
------ Key Generation ------ ------ Test Encryption/Decryption ------ Number of tests: 2560 PASS ------ Initilizating Data on GPU(s) ------ ------ Test NAND Gate ------ Number of tests: 2560 0.178565 ms / gate FAIL ------ Cleaning Data on GPU(s) ------