mbevand / silentarmy

Zcash miner optimized for AMD & Nvidia GPUs
342 stars 188 forks source link

Poor performance on pre-Maxwell Nvidia GPUs (local memory atomics) #70

Open computerlyrik opened 7 years ago

computerlyrik commented 7 years ago

The commit: https://github.com/mbevand/silentarmy/commit/b879b795141b95d0878e93bd9ae5cab120149891

Before this commit, hashrates were ok.

I use 3 Threads on 4GB GTX 760. Before: ~15H/s After: ~7-9H/s

I am running on arch linux with nvidia 375.10 and Cuda 8.0 installed

ddobreff commented 7 years ago

Confirmed. K2200 with v4+extremal's patches 22-25S/s , latest v5 9-14S/s.

Kubuxu commented 7 years ago

Atomic ops are here in play as they are not optimized for older architectures.

mbevand commented 7 years ago

Thanks for the report, I did not test on older Nvidia gear before committing. I think it's probably usage of local memory, not atomics, that hamper performance.

Can you guys try this: edit param.h, find this line:

define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 5)

And change "5" to a value between 1 and 5. Recompile and test. See if it improves performance in any way.

tupieurods commented 7 years ago

Until maxwell there is was no shared(eq of local in opencl) memory atomics in hardware, they were software emulated, so it is dead slow on arch < maxwell. Proof: https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

blackjec69 commented 7 years ago

On GTX 750 (1 Gb, CUDA 7.5) best result (around 18 sol/s) with next values:

define NR_ROWS_LOG 19

define OPTIM_SIMPLIFY_ROUND 1

define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 2)

montvid commented 7 years ago

Seems to be the best spec... interesting that tromp's cuda gives a more stable sol count overall. Silentarmy is jumping up and down. ID 0: GeForce GT 740M

define NR_ROWS_LOG 20

define OPTIM_SIMPLIFY_ROUND 1

define COLL_DATA_SIZE_PER_TH (NR_SLOTS * 3)

Kubuxu commented 7 years ago

@tupieurods I think we are using global atomics.

// I was wrong.

mbevand commented 7 years ago

@tupieurods You are right. Didn't know shared atomics were not hardware implemented pre-Maxwell. That' s definitively the cause of the slowdown then, because this commit makes heavy use of shared atomics. I see no solution other than maintaining a 2nd separate version of input.cl specifically for pre-Maxwell Nvidia GPUs then.

In the mean time, the workaround is for pre-Maxwell users to revert to SA v4, or more specifically to the last revision not using local atomics. After a git clone:

$ cd silentarmy && git checkout 243ed569bac5e17305825645023296ccf09c6eeb

Singman33 commented 7 years ago

Nope, that command give us a slow version. I have 25-26 Sol/s with a custom version I've forked before 2 or 3 patches. Not sure why, I will try to publish it on Github this week-end.