tromp / cuckoo

a memory-bound graph-theoretic proof-of-work system
822 stars 173 forks source link

cuckatoo/cuda31 doesn't work with Nvidia Tesla P100 et V100 GPUs #74

Closed mably closed 5 years ago

mably commented 5 years ago

Here are the logs we get when running the cuda31 utility program with -E 2 -r 100 parameters:

Tesla P100: Tesla V100:

tromp commented 5 years ago

What is status of latest C31 plugins on your cards?

mably commented 5 years ago

It's ok now, I close the issue.

BravoIndia commented 5 years ago

@tromp Sorry to reopen the issue, I just updated my C31 GPS for the V100 here:

Does 0.18 GPS on C31 seem right for the Tesla v100? (which is 1/10th the RTX2080 when they get the same GPS for C29)

Is it possible my plugins are configured incorrectly? Couldn't get any C31 plugin to work except the ocl and lean-cuda (the rtx and gtx plugins errored)

tromp commented 5 years ago

0.18 GPS on C31 seems reasonable for cuckatoo_lean_cuda_31

But why couldn't you get cuckatoo_mean_cuda_gtx_31 to work? What was the error?

BravoIndia commented 5 years ago

Thank you for following up @tromp!

Error when attempting the cuckatoo_mean_cuda_gtx_31 plugin is:

ERRO Plugin cuckatoo_mean_cuda_gtx_31 has errored, device: Tesla V100-SXM2-16GB. Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429

Here's the rest of the grin-miner.log:

Jan 09 10:39:49.313 DEBG sending request: {"id":"0","jsonrpc":"2.0","method":"getjobtemplate","params":null}
Jan 09 10:39:50.001 DEBG Received message: {"id":"0","jsonrpc":"2.0","method":"login","result":"ok"}

Jan 09 10:39:50.001 DEBG Received response with id: 0
Jan 09 10:39:50.001 DEBG Received message: {"id":"0","jsonrpc":"2.0","method":"getjobtemplate","result":{"difficulty":1,"height":16654,"job_id":1,"pre_pow":"0001000000000000410e000000005c35cf70006e16d49719bc759cfd53c46fb7e796700308048712ed8fabc374d64ec39552c3ddcf62df7b0f6f5c81d84df278cb9498b0250e54fb1ebce5f3ecb281b95a874e9ed224a7ac4a96520a0e6a20548e37b4b63604d23cc128e75371a3f0f5e8cacceac3e537bc45674fe19969ace0b2c8f7e807987da18393b5e3d6cd6dceff7b2b07553f087aef10ccae7b5515e48ca7a6c32b814daa4a4dfffe30830df50662ac2486491e1629e70feba2aa17f73573f4b1b3f7f4c85931c7620cfacf16386f0000000000017dbf000000000001008800000000620174db0000049a"}}

Jan 09 10:39:50.001 DEBG Received response with id: 0
Jan 09 10:39:50.001 INFO Got a job at height 16654 and difficulty 1
Jan 09 10:39:50.043 DEBG Miner received message: ReceivedJob(16654, 1, 1, "0001000000000000410e000000005c35cf70006e16d49719bc759cfd53c46fb7e796700308048712ed8fabc374d64ec39552c3ddcf62df7b0f6f5c81d84df278cb9498b0250e54fb1ebce5f3ecb281b95a874e9ed224a7ac4a96520a0e6a20548e37b4b63604d23cc128e75371a3f0f5e8cacceac3e537bc45674fe19969ace0b2c8f7e807987da18393b5e3d6cd6dceff7b2b07553f087aef10ccae7b5515e48ca7a6c32b814daa4a4dfffe30830df50662ac2486491e1629e70feba2aa17f73573f4b1b3f7f4c85931c7620cfacf16386f0000000000017dbf000000000001008800000000620174db0000049a")
Jan 09 10:39:50.043 DEBG Pause message sent
Jan 09 10:39:50.043 DEBG Resume message sent
Jan 09 10:39:50.043 DEBG solver_thread - solver_loop_rx got msg: Pause
Jan 09 10:39:50.044 DEBG solver_thread - solver_loop_rx got msg: Resume
Jan 09 10:39:50.044 ERRO Plugin cuckatoo_mean_cuda_gtx_31 has errored, device: Tesla V100-SXM2-16GB. Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:39:51.045 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:39:51.045 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:39:54.048 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:39:54.048 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:39:57.052 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:39:57.052 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:00.000 DEBG Received message: {"id":"Stratum","jsonrpc":"2.0","method":"job","params":{"difficulty":1,"height":16654,"job_id":2,"pre_pow":"0001000000000000410e000000005c35cf7f006e16d49719bc759cfd53c46fb7e796700308048712ed8fabc374d64ec39552c3ddcf62df7b0f6f5c81d84df278cb9498b0250e54fb1ebce5f3ecb281b95a87cee302884f7a6b397320439ea0ed3eaa2904dfbca41e2ae3ef855a227b764a8dedcbd08f1076c3cb82c1270629019c7fad17aeda3acdb6eb146b3e4c10b6bf3b95e1c2e479f4ff8a3f2a25a7f2bdfbc7b286dbcdf2b3b75965152e79ddf0762db03f47977eeaf1c80d992cd316bc04b6e56943abad4d649c725a5b715faea25f0000000000017dc2000000000001008900000000620174db0000049a"}}

Jan 09 10:40:00.000 DEBG Received request type: job
Jan 09 10:40:00.000 INFO Got a new job: JobTemplate { height: 16654, job_id: 2, difficulty: 1, pre_pow: "0001000000000000410e000000005c35cf7f006e16d49719bc759cfd53c46fb7e796700308048712ed8fabc374d64ec39552c3ddcf62df7b0f6f5c81d84df278cb9498b0250e54fb1ebce5f3ecb281b95a87cee302884f7a6b397320439ea0ed3eaa2904dfbca41e2ae3ef855a227b764a8dedcbd08f1076c3cb82c1270629019c7fad17aeda3acdb6eb146b3e4c10b6bf3b95e1c2e479f4ff8a3f2a25a7f2bdfbc7b286dbcdf2b3b75965152e79ddf0762db03f47977eeaf1c80d992cd316bc04b6e56943abad4d649c725a5b715faea25f0000000000017dc2000000000001008900000000620174db0000049a" }
Jan 09 10:40:00.056 DEBG Miner received message: ReceivedJob(16654, 2, 1, "0001000000000000410e000000005c35cf7f006e16d49719bc759cfd53c46fb7e796700308048712ed8fabc374d64ec39552c3ddcf62df7b0f6f5c81d84df278cb9498b0250e54fb1ebce5f3ecb281b95a87cee302884f7a6b397320439ea0ed3eaa2904dfbca41e2ae3ef855a227b764a8dedcbd08f1076c3cb82c1270629019c7fad17aeda3acdb6eb146b3e4c10b6bf3b95e1c2e479f4ff8a3f2a25a7f2bdfbc7b286dbcdf2b3b75965152e79ddf0762db03f47977eeaf1c80d992cd316bc04b6e56943abad4d649c725a5b715faea25f0000000000017dc2000000000001008900000000620174db0000049a")
Jan 09 10:40:00.056 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:40:00.056 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:03.059 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:40:03.059 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:06.063 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:40:06.063 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:09.067 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:40:09.067 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:12.070 DEBG Mining: Plugin 0 - Device 0 (Tesla V100-SXM2-16GB) Has ERRORED! Reason: Device 0 GPUassert: out of memory /home/travis/build/mimblewimble/grin-miner/cuckoo-miner/src/cuckoo_sys/plugins/cuckoo/src/cuckatoo/ 429
Jan 09 10:40:12.070 INFO Mining: Cuck(at)oo at 0 gps (graphs per second)
Jan 09 10:40:14.025 DEBG Client received message: Shutdown
Jan 09 10:40:14.025 DEBG Shutting down client controller
Jan 09 10:40:14.073 DEBG Miner received message: Shutdown
Jan 09 10:40:14.073 DEBG Stopping jobs and Shutting down mining controller
Jan 09 10:40:14.073 DEBG Stop message sent
Jan 09 10:40:14.189 DEBG Solver stopped: 0

Not sure why it runs out of memory. nvidia-smi shows 16gb, none of which are in use after the miner is shut down.

| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   31C    P0    36W / 300W |      0MiB / 16130MiB |      0%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No running processes found                                                 |

Please let me know if you need me to run anything else, and thanks again.

tromp commented 5 years ago

my repo has a little utility called in src/cuckatoo can you run make cumal and run it on your device to see what is the max memory you can allocate?

BravoIndia commented 5 years ago

I'm on it.

Just to clarify (sorry, inexperienced):

git clone
cd cuckoo/src/cuckatoo
make cumal

And then run ./cumal?

tromp commented 5 years ago

And then run ./cumal?

That's what you normally do with executables:-)

BravoIndia commented 5 years ago

Roger :) Is this right then? Thought it likely I'm screwing something up:

ubuntu@ip-172-31-4-191:~$ git clone
Cloning into 'cuckoo'...
remote: Enumerating objects: 120, done.
remote: Counting objects: 100% (120/120), done.
remote: Compressing objects: 100% (66/66), done.
remote: Total 4083 (delta 86), reused 83 (delta 54), pack-reused 3963
Receiving objects: 100% (4083/4083), 12.79 MiB | 32.81 MiB/s, done.
Resolving deltas: 100% (2846/2846), done.
ubuntu@ip-172-31-4-191:~$ cd cuckoo/src/cuckatoo
ubuntu@ip-172-31-4-191:~/cuckoo/src/cuckatoo$ make cumal
nvcc -std=c++11  -o cumal In function ‘int main(int, char**)’: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
     if (ret) printf("cudaMalloc(%d MB) returned %d\n", bufferMB, ret);
                                                                   ^ warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat=]
   printf("cudaMalloc(%d MB) succeeded %d\n", bufferMB);
                                                    ^ warning: format ‘%d’ expects a matching ‘int’ argument [-Wformat=]
ubuntu@ip-172-31-4-191:~/cuckoo/src/cuckatoo$ ./cumal
cumal: int main(int, char**): Assertion `device < nDevices' failed.
Aborted (core dumped)
BravoIndia commented 5 years ago

Okay! Issue was by using the cuckatoo_mean_cuda_gtx_31 plugin with expand = 2 uncommented. Thank you sincerely @tromp for spending the past few hours looking at this.

You can also the edit NEPS_A and NEPS_B values to 133 / 88 respectively in cuckoo-miner/src/cuckoo_sys/plugins/CMakeLists.txt for a significant GPS increase to eliminate the slight loss in solutions needed to fit 11gb as follows:

build_cuda_target("${AT_MEAN_CUDA_SRC}" cuckatoo_mean_cuda_gtx_31 "-DNEPS_A=133 -DNEPS_B=88 -DPART_BITS=1 -DEDGEBITS=31")

See here: