maztheman / nheqminer

Equihash miner for NiceHash
https://www.nicehash.com
MIT License
12 stars 5 forks source link

TODO: Test CUDA Silentarmy v4 Implementation on a wide variety of devices #1

Closed maztheman closed 7 years ago

maztheman commented 7 years ago

Right now i only tested on GTX 650, and it "works" but possible could use a couple tweaks.

maztheman commented 7 years ago

Looks like GTX 1080's are getting super LOW sols, will have to investigate..

maztheman commented 7 years ago

GTX 1070: -cb 256 -ct 32 seems to work Though the blocks should be in multiples of the sm count so 255 might work better..

chronosek commented 7 years ago

is not cb ct for cuda part? i think opencl silentarmy not use that values edit: oh, didnt look what you did in commits, so you trying port to cuda? I will test it

maztheman commented 7 years ago

Yes I ported silent army to cuda !

chronosek commented 7 years ago

Nice!!! It compile without any problem.

i got on gtx980 38 sol/s (cb 128, ct 64) (100% gpu load, 28% memory load, 63% power) CPU 0% !!! (btw something is wrong, cause i could not increase ct more than 64, (gtx980 have 1024thread/block))

for comparision: nicehash/nheqminer cuda - 25 sols (cb 64 ct 64) 2x nicehash/nheqminer opencl silentarmy - together 33 sols (but each instance use full core) zcminer - 40 sols (but fee 2,5% and using one full core)

ill try compile for my 650m one sec..

maztheman commented 7 years ago

64/64 seems to be great on a GTX 970 as report by another user

chronosek commented 7 years ago

for me 128/64 was better (tryed many values)

results on gt650m: 6.2 sol/s (cb 128, ct 64)

for comparision: nicehash/nheqminer cuda - 3.1 sol/s 1x nicehash/nheqminer opencl silentarmy - 6.1 sols (but use one full core, to low gpu mem to run 2 instances) zcminer - not working

maztheman commented 7 years ago

http://stackoverflow.com/questions/4391162/cuda-determining-threads-per-block-blocks-per-grid

chronosek commented 7 years ago

Looks like it is a success, i hope you will keep updating/optimizing, cause i like cuda version better than opencl

maztheman commented 7 years ago

Yes, I will keep this up to date with the silent army builds. Technically I have no real idea what this code is doing. I just poet it over. All the hard work is done by someone else. :-).

maztheman commented 7 years ago

Also I can't post any more on z cash forum. I'd like to get more people testing.

chronosek commented 7 years ago

I even do not know basic programming... I can only use tools, and compile... Btw you should set in VS project more optimize options, specific code generation and remove debug from release (thats why i always compile)

bigchauncey commented 7 years ago

Mine is win10 64, 5 1070 cards, how to set parameters?

maztheman commented 7 years ago

-cs -cb 64 -ct 64 -cd 0 1 2 3 4

maztheman commented 7 years ago

Try that for now

krnlx commented 7 years ago

Linux. no cmake file, but I created ones... and avx need to launch sa.. fixed it too.

maztheman commented 7 years ago

Thanks!

krnlx commented 7 years ago

best results got with -ct 32 -cb 90 and only 38 s/s on 1070 -(

Can you take a look on my optimized sa opencl version for nvidia ? sa-nv.tar.gz

it gets 50 s/s on 1070 with 1 thread

krnlx commented 7 years ago

NR_ROWS_LOG must be 19 on nv and OVERHEAD 8 my source contains other improvements in input.cl by eXtremal and others https://bitcointalk.org/index.php?topic=1666489.360

bigchauncey commented 7 years ago

GTX 1070: -cb 256 -ct 32 seems to work Though the blocks should be in multiples of the sm count so 255 might work better..

I try as this parameter and get 148h/s total with win10 64, five 1070 cards. And now try other parameters.

krnlx commented 7 years ago

-ct 64 -cb 256 40 s/s on 1070

mendoza1468 commented 7 years ago

Gtx 1080 best speed : nheqminer -cs -cb 8192 -ct 8 = 33/sols Not fast enough but continue your good job you doing well :)

maztheman commented 7 years ago

I might try to convert it to use 2d allocations as it might be more efficient

bigchauncey commented 7 years ago

-cb 256 -ct 32 seems only reset the first card.

bigchauncey commented 7 years ago

gtx 1070 #0 :blocks=256,threads=32, gtx 1070#1:blocks=480,threads=64 as default. others the same as gtx1070 #1

maztheman commented 7 years ago

Oh okay I'll make some changes so it'll force for all.

chronosek commented 7 years ago

Btw there should be a tool for benchmark cb, ct options (like run 2 sec test in loop and record sol/s for a range of cb, ct and sort for best results) then everyone could run and check what is best for them

chronosek commented 7 years ago

Looks like peoples found 2 small bugs, one where with -cs but without -cv 0 it ends in cuda_tromp (nheqminer was set default on old cuda_trump), and second where without avx but even with cv 0 it ends in cuda_tromp (maybe some code check for avx)

dtawom commented 7 years ago

Does not work on GTX 580 running CUDA 2.0. Oddly enough though silentarmy seems to work fine on R7 APU GPU at @ 3 Sol/sec and an Intel integrated graphics using the -od switch. Only got like 3 sol/sec on intel graphics but every little bit helps and it didn't seem to decrease my CPU sol rate.

maztheman commented 7 years ago

hey guys, I think i may have figured out what was going on. I had some launch bounds which I think caused cuda to force only 64 blocks or something...Im gonna make that small change and checkin, and build. then i will be looking at some other major changes which will take longer. I will be posting a new build.

krnlx commented 7 years ago

Please check last updates HUGE improvemnts. 1070 = 80 s/s https://bitcointalk.org/index.php?topic=1666489.400

kruisdraad commented 7 years ago

the silent army option is not working and ignored still reporting CUDA

maztheman commented 7 years ago

@kruisdraad I think in the new version 0.4g this is addressed.

kruisdraad commented 7 years ago

@maztheman how come people are reporting a 80 sols/s per card where at any test i dont get passed 45 at all. that on Linux @krnlx ? whats the power / gpu usage?

kruisdraad commented 7 years ago

@krnlx changing input.cl does not work on windows, even deleting it with kernel.cl does not change (i would assume the miner wouldnt start at all) perhaps its hardcode in the exe?

maztheman commented 7 years ago

I don't know where the 80 sols / s are coming from. Probably not the cuda version I have made here. Im trying to update the kernel based off what @krnlx posted. I have yet to have it pass any tests. There seems to be a indexing problem that is causing the kernel to fail. It's just a technical issue that I need to debug. I made the 0.4g removed some threading restrictions that was not required in the old kernel I made. But with the. New kernel looks like the limit has to be there. I'll keep you guys posted.

tpruvot commented 7 years ago

consider some cleanup.. see https://github.com/tpruvot/nheqminer/commits/cuda-silentarmy

tromp avx and the double cuda sdk trick are not useful, but tx for your work.. seems a good base to begin the cuda work ;)

montvid commented 7 years ago

How do you compile cuda silentarmy on ubuntu 16.04? There is no cmake file in the dirs?

krnlx commented 7 years ago

all tweaks in git now, I only fix cpu load. https://github.com/mbevand/silentarmy

chronosek commented 7 years ago

tested tpruvot with SA5 cuda port from krnlx, it gave my gtx980 58 sol/s I know that it is not final product, but wanted to share...

for comparision:

maztheman commented 7 years ago

new build, discuss it here

https://github.com/maztheman/nheqminer/issues/5

5

auroracoin commented 7 years ago

Is it possible you could compile it for windows 8.1? thx