sgminer-dev / sgminer

Scrypt GPU miner

GNU General Public License v3.0

631 stars 825 forks source link

custom kernel #202

Open sterlingpickens opened 10 years ago

sterlingpickens commented 10 years ago

http://sterlingdesktops.com/pub/test/lsoc.cl

I made some fairly extensive changes in this kernel. The binary/code size is smaller than a stock ck/zuikkis. Apparently my 7870/7850 cards really like this, and other cards might too. It works with lookup-gap 1 to 8. You'll likely have to experiment with a different TC value to get peak performance, but I was able to get +10KH/s on my cards.

veox commented 10 years ago

Which kernel did you use as a base?

sterlingpickens commented 10 years ago

It's closer to zuikkis as it doesn't have any of the pragma unrolls. The main functional difference to zuikkis is it's not tied to lg2, although support for lg other than 2 might not help anyone. TBH i'm starting to have second thoughts on any performance benefit, it seems like everything is basically within the margin of error. Maybe I should put this on the backburner for now. If you want to add it I can submit a pull and potentially tweak it a bit later on.

sterlingpickens commented 10 years ago

http://sterlingdesktops.com/pub/test/lsoc3.cl

troky commented 10 years ago

I've tested (R9 290) lsoc.cl for more than 24h and I don't see any improvement. What have you changed in lsoc3.cl? Please post some more information instead of just link...

sterlingpickens commented 10 years ago

This last revision has several changes. The most notable being a discovery that calling bitselect in quick succession is actually slower. Toying with the ideas from Issue #71 I discovered that some cards were faster/slower with 4 in a row, but if only the first of 4 is bitselect both my cards are faster by 1-2% overall.

define EndianSwapa(n) (Ch(ES[0], rotl(n, 8U), rotl(n, 24U)))

define EndianSwapb(n) (rotl(n & ES[0], 24U)|rotl(n & ES[1], 8U))

four of either in a row is slower than a, b, b, b X[0] = EndianSwapa(tmp[0]); X[1] = EndianSwapb(tmp[1]); X[2] = EndianSwapb(tmp[2]); X[3] = EndianSwapb(tmp[3]);

I also cleaned up SHA256 I played with the idea of turning that into a for/while loop but that actually dropped performance. The endianswap trick got me thinking about Ch and Maj, but the ordering of the instructions is so interleaved, I wasn't able to get any more performance.

484KH/s from 474KH/s on my 7870 358KH/s from 353KH/s on my 7850

luke-jr commented 10 years ago

Wait, why are we doing endian swapping in OpenCL code??

OhGodAPet commented 10 years ago

You know, Luke, I wondered the same thing when I looked at it.

luke-jr commented 10 years ago

I looked into it more about a week ago. Looks unavoidable, a design flaw in the scrypt PoW algorithm :(

poiuty commented 10 years ago

Fist R9 270X, second HD 7770 "xintensity" : "4,4", "gpu-threads" : "2,1", "algorithm" : "adaptive-n-factor", "gpu-engine" : "1150,1100", "gpu-memclock" : "1250,1250", "thread-concurrency" : "5121,4121",

alexkarold: R9 270X => 252.4Kh/s, HD7770 => 85.34Kh/s http://poiuty.com/img/6a5710719f94c06b257d82e5076b.png

lsoc: R9 270X => 252.4Kh/s, HD7770 => 86.13Kh/s http://poiuty.com/img/879516844048dc1a8cbc7b9bc03b.png For R9 270X try up E = 1170 & M = 1500, get 259Kh/s http://poiuty.com/img/50f4717ab8e8c78f75de3a76dd3b.png

lsoc3: R9 270X => 243.6Kh/s, HD7770 => 87.73Kh/s http://poiuty.com/img/2d941d1932ad6c345a266893eeb1.png