spacemeshos / post

Spacemesh POST protocol implementation
MIT License
19 stars 20 forks source link

CL_MEM_OBJECT_ALLOCATION_FAILURE on GTX 980 #159

Closed pigmej closed 1 year ago

pigmej commented 1 year ago
2023-06-16T06:13:53.748+1000    DEBUG    5c0d8.post    Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce GTX 980    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 314}
2023-06-16T06:13:53.748+1000    DEBUG    5c0d8.post    device memory: 4036 MB, max_mem_alloc_size: 1009 MB, max_compute_units: 16, max_wg_size: 1024    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 134}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    preferred_wg_size_multiple: 32, kernel_wg_size: 256    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 168}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Using: global_work_size: 2016, local_work_size: 32    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 181}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for input: 32 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 185}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for output: 64512 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 193}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for lookup: 1056964608 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 201}
Also this
2023-06-16T17:40:17.305+1000    DEBUG    5c0d8.post    Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce GTX 960    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 314}
2023-06-16T17:40:17.305+1000    DEBUG    5c0d8.post    device memory: 1996 MB, max_mem_alloc_size: 499 MB, max_compute_units: 8, max_wg_size: 1024    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 134}
2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for input: 32 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 185}
2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for output: 31744 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 193}
2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for lookup: 520093696 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 201}
2023-06-19T06:03:59.735+1000    DEBUG    5c0d8.post    initializing 1 -> 993 (992 labels, GWS: 992)    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 253}
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueNDRangeKernel("scrypt")  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors  

############################################################################# 
)))', ffi/src/initialization.rs:135:10
poszu commented 1 year ago

I cannot reproduce this locally. I need more logs to debug this (the ones attached miss the actual OCL error and stack trace). Anyway, I looked over the code and I didn't find an unwrap() on OCL results hence I think that the failed unwrap() is coming from the insides of the ocl crate.

poszu commented 1 year ago

Thanks to @pigmej for more logs. I was able to identify the place where unwrap() is called. The code should definitely not unwrap() (which causes a panic) but handle the error - in this case, probably bubble it up to the caller to let it retry.