vernamlab / cuFHE

CUDA-accelerated Fully Homomorphic Encryption Library
MIT License
211 stars 60 forks source link

test_api_gpu fails every time with CUDA_EXCEPTION_15 #2

Closed sdorminey closed 6 years ago

sdorminey commented 6 years ago

test_api_gpu dies for me, every time, with Invalid Managed Memory Access, while evaluating the Nand gate (before bootstrapping occurs.) It looks like this code is running on the Host Thread, but the underlying data (in the Unified Memory) is mapped to the GPU, causing an error.

Would love a workaround, since this project looks really neat! Let me know if you need more info.


System setup:

Output: ------ Key Generation ------ ------ Test Encryption/Decryption ------ Number of tests: 96 PASS ------ Initilizating Data on GPU(s) ------ ------ Test NAND Gate ------ Number of tests: 96 (crashes here)

Stack trace: Thread [1] 14501 [core: 2] (Suspended : Signal : CUDA_EXCEPTION_15:Invalid Managed Memory Access)
cufhe::Nand() at cufhe_gates_gpu.cu:50 0x7ffff7b18223
main() at test_api_gpu.cu:116 0x4048c1

WeiDaiWD commented 6 years ago

Thank you very much for your report. I think that I have found the reason of this crash.

Since we are launching several NAND gates concurrently on a single device, while one NAND gate is running a kernel that accesses some unified memory, another NAND gate accesses some other unified memory from the host. This is not allowed on devices with compute capability < 6.x: Unified memory coherency and concurrency.

I am working on a work-around solution that allocates both host and device memory and transfers data when needed. Hopefully I will posted it tomorrow when the new fix passes on a Titan X (compute capability 5.2).

Only if that does not work on your device, then I need more info on your side. Thanks again.

WeiDaiWD commented 6 years ago

OK, this is not a perfect fix. Please try to compile/run the code in New Branch. This new fix does not use unified memory. I see no crash on a Titan X. Let me know if it still does not work for your system.

Ironically, I now see a new issue which is the reason why I didn't merge it to master. After the fix, less than 0.5% of gates gives wrong result. I do not have much a clue here. It could be the problem of using pinned memory. I will have to test with page-locked memory and see. If you have some idea about this, please shine some light here. I would very much appreciate that.

WeiDaiWD commented 6 years ago

I have temporarily created another branch for pre-Pascal GPUs. Performance is much slower since I have to disable concurrent launching of kernels for now. The results are correct and safe to play with. I am working on the perfect cure now.

sdorminey commented 6 years ago

Awesome! test_api_gpu now succeeds, and I get ~22ms per gate, testing with both hot-fix and older_than_6.0_no_concurrency. Thank you for the speedy workaround!

I'm going to play with the python bindings next - I'll let you know if I run into any issues.