TODO: Test Silent army v5 of a wide variety of devices

maztheman commented 7 years ago

Discuss your results here

kruisdraad commented 7 years ago

Windows build please

maztheman commented 7 years ago

it is failing, i have to fix it up...

maztheman commented 7 years ago

If you have a 1070 or 1080 please test it:

https://github.com/maztheman/nheqminer/releases/tag/v0.4h

drigger commented 7 years ago

maztheman commented 7 years ago

try: https://github.com/maztheman/nheqminer/releases/download/v0.4h/v0.4h_1070_plus.7z

i remove all the thread waiting..

maztheman commented 7 years ago

the v5 silent army doesnt seem to be too great when converted to CUDA :(, however It might just be the way its looping...i might be able to fix it

drigger commented 7 years ago

Hmm, w/ 0.4h_1070_plus this happens: ;) Just for the sake of comparison - this is what I get w/ the same 1080 on one of the original SA5 python ports for windows:

maztheman commented 7 years ago

yeah there is some major issue with cuda and that code that was posted for opencl. cuda keeps "timing" out when I try to do same kind of calls...I guess itll have to be a work in progress for now.

Ill post again when I see improvement with my GTX 650, which should then indicate some progress with 1080s, etc.

maztheman commented 7 years ago

I see from the author or silentarmy:

mbevand commented 2 days ago • edited @tupieurods You are right. Didn't know shared atomics were not hardware implemented pre-Maxwell. That' s definitively the cause of the slowdown then, because this commit makes heavy use of shared atomics. I see no solution other than maintaining a 2nd separate version of input.cl specifically for pre-Maxwell Nvidia GPUs then.

maztheman commented 7 years ago

Well based off that guys message I reverted back to a more direct conversion, maybe itll work better:

https://github.com/maztheman/nheqminer/releases/download/v0.4h/v0.4h_MAXWEL_PLUS.7z

It was super slow on GTX 650 but looks like it will be because of the shared atomics

drigger commented 7 years ago

It gives me much higher I/s numbers + full GPU load, but no Sols, unfortunately =)

maztheman commented 7 years ago

hmm wierd, i get 78 sols/s with my R9 290 with the v5 code in open cl. 0 sols probably because some timeout is causing a crash

kruisdraad commented 7 years ago

I ran both versions.

Setup: 6x GTX1070 with Windows 10 latest drivers, etc (anno vers)

0.4h plain: 241 sol/s power usage 51% 0.4h maxplus: 2.4 sol/s and a near 400 I/s. Also the power usage is above 70%

previsous version got about 250 so its a little less, power usage is much more stable though.

maztheman commented 7 years ago

hmm 241 is really not that competitive to the SA linux version, is it?

kruisdraad commented 7 years ago

Honestly i havent tried the linux version yet, The zcminer-dev windows version gets about 300 so its not that bad.

You have Linux example on actual speeds?

chronosek commented 7 years ago

Compiled. For me crashing after 20 sec (i see constantly increase use of gpu memory till all 4GB is fillup) and then nvidia is in P5 locked state..

[23:00:05][0x000011dc] Using SSE2: YES [23:00:05][0x000011dc] Using AVX: YES [23:00:05][0x000011dc] Using AVX2: YES [23:00:05][0x000011a8] stratum | Starting miner [23:00:05][0x000011a8] stratum | Connecting to stratum server eu1-zcash.flypool.org:3333 [23:00:05][0x00001194] miner#0 | Starting thread #0 (CUDA-SILENTARMY) GeForce GTX 980 (#0) BLOCKS=64, THREADS=512 [23:00:05][0x000011a8] stratum | Connected! [23:00:05][0x000011a8] stratum | Subscribed to stratum server [23:00:05][0x000011a8] miner | Extranonce is fa613f0f55 [23:00:05][0x000011a8] stratum | ←[35mTarget set to 004189374bc6a7ef9db22d0e5604189374bc6a7ef9db22d0e5604189374bc6a7←[0m [23:00:06][0x000011a8] stratum | ←[36mReceived new job #11ee30962342110c9785←[0m [23:00:20][0x000011dc] ←[33mSpeed [300 sec]: 8.72074 I/s, 15.646 Sols/s←[0m [23:00:36][0x000011dc] ←[33mSpeed [300 sec]: 8.94374 I/s, 16.6693 Sols/s←[0m

CUDA error 'out of memory' in func 'sa_cuda_context::solve' line 1029 CUDA error 'out of memory' in func 'sa_cuda_context::solve' line 1030 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1073 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1024 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1025 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1029 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1030 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1073 CUDA error 'unspecified launch failure' in func 'sa_cuda_context::solve' line 1024 [ ... ]

maztheman commented 7 years ago

yes, sorry there is a memory "leak" that i already have fixed but not have pushed Its only supposed to create the 2 buffers once, and reuse the, in my old code I was creating it every time and not deleting....:P

auroracoin commented 7 years ago

I tried it on a 750ti comp 5.0 and I get this massage. Missing file MSVCP140.dll for win 8.1

maztheman commented 7 years ago

oh, you need to install the redist file: https://www.microsoft.com/en-ca/download/details.aspx?id=48145

maztheman commented 7 years ago

@chronosek ive checked in the fixes

auroracoin commented 7 years ago

That worked...thx :)

chronosek commented 7 years ago

@maztheman thx, not crashing now, but always get 0 sol/s no matter what -cb, -ct will set

maztheman commented 7 years ago

Yeah some internal issue, ill have to fully debug it again...

dtawom commented 7 years ago

Getting @ 17-18 sol/s with a 1060 3gb card with 0.4h. I found the best rate for this card is obtained with -cb 128 -ct 32 (autodect puts it at -cb 63 ct 64 and only yields 15 to 16 sol/s.

krnlx commented 7 years ago

removing packed atomic counters = bad, I tested it in opencl, it is fastest. I tested packed 64-bit atomics too, 1-2% slower

chronosek commented 7 years ago

tpruvot did good job porting to cuda, specially with atomic part (working stable at 58 sol/s but still eqm doing 66 sol/s), i think maztheman problems was from some hardcoded values or other code what was not in silentarmy

maztheman commented 7 years ago

I added a test program that will help me debug this cuda issue. Please post your log files.

drigger commented 7 years ago

log.txt

maztheman commented 7 years ago

I created a new build that should "work" on 1080 and 1070's again. Probably wont break any records though...

drigger commented 7 years ago

Confirmed, it works and even gives a bit better results on my GTX 1080 than all the previous builds. Thank you!

chronosek commented 7 years ago

working for gtx980 - got 41 sol/s

dtawom commented 7 years ago

Still not working with GTX 580 1.5GB

maztheman commented 7 years ago

@dtaworm, can you run the debug tool and post the log?

dtawom commented 7 years ago

@maztheman yeah i'll do that tomorrow afternoon when I get into work.

auroracoin commented 7 years ago

On CUDA, it doesn't display what cards are doing what hash. :)

drigger commented 7 years ago

arg 1 = -r
Settings:
NR_ROWS_LOG = 20
OVERHEAD = 6
COLL_DATA_SIZE_PER_TH = 60
Run Count = 100
Device #0 | GeForce GTX 1080 | Running Tests....
Silentarmy V5 test
35.878 sols/s
Silentarmy V4/V5/maztheman mix test
29.643 sols/s
maztheman test
0 sols/s

arg 1 = -r
Settings:
NR_ROWS_LOG = 20
OVERHEAD = 6
COLL_DATA_SIZE_PER_TH = 60
Run Count = 100
Device #0 | GeForce GTX 580 | Running Tests....
Silentarmy V5 test
18.6791 sols/s
Silentarmy V4/V5/maztheman mix test
30.8821 sols/s
maztheman test
12.5392 sols/s

arg 1 = -r
Settings:
NR_ROWS_LOG = 20
OVERHEAD = 6
COLL_DATA_SIZE_PER_TH = 60
Run Count = 100
Device #0 | GeForce GTX 550 Ti | Running Tests....
Silentarmy V5 test
<error> cuda error 2, out of memory in function 'context::init', line 1313
context intialization failed
...
0 sols/s
Silentarmy V4/V5/maztheman mix test
<error> cuda error 2, out of memory in function 'context::init', line 1313
context intialization failed
...
0 sols/s
maztheman test
<error> cuda error 2, out of memory in function 'context::init', line 1313
context intialization failed
...
0 sols/s

maztheman commented 7 years ago

hmm interesting

oddin98 commented 7 years ago

On my GeForce GTX 650 nheqminer -t 0 -cd 0 -cs -b

[15:25:51][0x00000a7c] Using SSE2: YES [15:25:51][0x00000a7c] Using AVX: NO [15:25:51][0x00000a7c] Using AVX2: NO [15:25:51][0x00000a7c] Benchmarking CUDA worker (CUDA-SILENTARMY) GeForce GTX 650 (#0) BLOCKS=14, THREADS=64 [15:25:51][0x00000a7c] Benchmark starting... this may take several minutes, please wait... [15:26:34][0x00000a7c] Benchmark done! [15:26:34][0x00000a7c] Total time : 42382 ms [15:26:34][0x00000a7c] Total iterations: 200 [15:26:34][0x00000a7c] Total solutions found: 374 [15:26:34][0x00000a7c] Speed: 4.71898 I/s [15:26:34][0x00000a7c] Speed: 8.8245 Sols/s

Is this good?

UPDATE: tinkering around

650

maztheman commented 7 years ago

For a GTX 650, that is what I get. I havent seen any higher.

maztheman commented 7 years ago

We're you able to over clock the card?

Edit: oh I see you have also 2 cpu cores

oddin98 commented 7 years ago

Noticed after 8 hrs of running or so, it drops speed to 1sol. The has to be restarted... any info as to why? On Nov 16, 2016 9:06 PM, "maztheman" notifications@github.com wrote:

We're you able to over clock the card?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maztheman/nheqminer/issues/5#issuecomment-261142382, or mute the thread https://github.com/notifications/unsubscribe-auth/ARG0jJBmUHwS7tInHzcr_GYNBEsz0BXWks5q-8SagaJpZM4KxkRY .

maztheman commented 7 years ago

Not really, is it possible that your GPU overheated or something?

dtawom commented 7 years ago

@maztheman I ran the new debug tool but where is the log file saved? Is there a switch I need to use to write the log to a text? I did notice that it gave me @ 18 sol/sec on the v4 and 24 sol/sec on the v5 with the 580. But with nheqminer using the -cd 0 switch detects the 580 but I'm getting 0 sol/s using the newest 0.4i.

maztheman commented 7 years ago

Yeah the debug tool has a little different code I was playing around with. To write the log add 1>run.txt to the end of your command line. Thst should write all the text to a file.

dtawom commented 7 years ago

@maztheman Here ya go, I also included a screenshot of it in regular mode. run.txt untitled

chronosek commented 7 years ago

i see you using cuda_tromp for old and gtx 580 what have compute capability 2.0, so i suspect it was not compiled for your card (in sources i see cuda_tromp project set for 2.0+, but restriction on equi_miner.cu set for 5.0+), try use silentarmy option or wait for maztheman new files

dtawom commented 7 years ago

Thanks @chronosek , maybe that will help @maztheman but I have no idea how to compile or change restrictions on equi_miner.cu from 5.0+to 2.0+

chronosek commented 7 years ago

If you want try cuda_tromp you could give a try dll compiled by me http://chronx.pl/cuda/cuda_tromp_75.dll (it was for cuda8.0 but still should work with new drivers) but still even without atomic support silentarmy port should be faster http://chronx.pl/cuda/cuda_silentarmy.dll , just replace files and use good options

dtawom commented 7 years ago

@chronosek that worked like a champ, getting @ 13 sol/sec which is better than the nothing I was getting before. Thanks man,

dtawom commented 7 years ago

Only get about 45 sol/sec with R9 fury. According to silentarmy readme I expected to be getting at least the same 115 sol/s of an R9 Nano. https://github.com/mbevand/silentarmy/commit/503697cd03d68eef9f6824aef920b50999bef4f1?short_path=04c6e90#diff-04c6e90faac2675aa89e2176d2eec7d8

maztheman / nheqminer

TODO: Test Silent army v5 of a wide variety of devices #5