Open pigmej opened 9 months ago
7900xtx reports:
2024-02-13T13:37:42.980+0100 INFO selecting 0 provider from 2 available {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 321}
2024-02-13T13:37:42.980+0100 INFO Using provider: [GPU] AMD Accelerated Parallel Processing/gfx1100 {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 334}
2024-02-13T13:37:42.980+0100 INFO device memory: 24560 MB, max_mem_alloc_size: 20876 MB, max_compute_units: 48, max_wg_size: 256 {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 152}
2024-02-13T13:37:43.763+0100 INFO preferred_wg_size_multiple: 32, kernel_wg_size: 256 {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 186}
2024-02-13T13:37:43.763+0100 INFO Using: global_work_size: 41728, local_work_size: 32 {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 199}
2024-02-13T13:37:43.763+0100 INFO Allocating buffer for input: 32 bytes {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 203}
2024-02-13T13:37:43.763+0100 INFO Allocating buffer for output: 1335296 bytes {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 211}
2024-02-13T13:37:43.763+0100 INFO Allocating buffer for lookup: 21877489664 bytes {"module": "scrypt_ocl", "file": "scrypt-ocl\\src\\lib.rs", "line": 219}
so it reports 48 CU while it have 96 actually.
Please note that the preferred_wg_size_multiple: 32
also suggests that the actual num is 96.
Please note that the 48CU is only used here as log message. The important part is max_mem_alloc_size
and global|local_work_size
On 7900xtx you can get more performance by setting LOOKUP_GAP
to 4 and that yields about 25% perf improvement.
I think for RDNA2 cards where we have high ram to CU ratio we're packing too much ram. In comparsion nvidia 4090 allows us to use 6GiB of ram (max_mem_alloc_size) while 7900xtx 21GiB.
That's because we are using right now one scrypter
(as prepared in https://github.com/spacemeshos/post-rs/blob/168ee31c606ba9514810bf41347d83b695642753/scrypt-ocl/src/lib.rs#L373 ) using only the first provider/device while looks like opencl for this card will most probably produce 2 provider/device pairs, each having 48 compute units.
If that's true (has to be tested on real HW) maybe we can create round robin list of scrypter
instances then use them in parallel in async later in https://github.com/spacemeshos/post-rs/blob/168ee31c606ba9514810bf41347d83b695642753/scrypt-ocl/src/lib.rs#L409 splitting the labels into len(devices)
parts?
This could be solved on a higher level by allowing parallel initialization on multiple GPUs (for example if the user has 2 cards attached to the PC). The code could distribute initialization tasks to all available devices in parallel.
It's not entirely like that. OpenCL on Windows returns one device with just half of the CU as stated above. But regardless of that, the RDNA2 cards are currently partially wrongly supported (the work is split in the wrong fashion).
Splitting the work to two while keeping the rest of the code as is yields even worse results.
From my understanding of RDNA2 architecture each RDNA2 (and newer) will report only half of CUs to the OpenCL.
(from the whitepaper)
The result is that it seems that the GPUs are underutilized.