Open sterlingpickens opened 10 years ago
Which kernel did you use as a base?
It's closer to zuikkis as it doesn't have any of the pragma unrolls. The main functional difference to zuikkis is it's not tied to lg2, although support for lg other than 2 might not help anyone. TBH i'm starting to have second thoughts on any performance benefit, it seems like everything is basically within the margin of error. Maybe I should put this on the backburner for now. If you want to add it I can submit a pull and potentially tweak it a bit later on.
I've tested (R9 290) lsoc.cl for more than 24h and I don't see any improvement. What have you changed in lsoc3.cl? Please post some more information instead of just link...
This last revision has several changes. The most notable being a discovery that calling bitselect in quick succession is actually slower. Toying with the ideas from Issue #71 I discovered that some cards were faster/slower with 4 in a row, but if only the first of 4 is bitselect both my cards are faster by 1-2% overall.
four of either in a row is slower than a, b, b, b X[0] = EndianSwapa(tmp[0]); X[1] = EndianSwapb(tmp[1]); X[2] = EndianSwapb(tmp[2]); X[3] = EndianSwapb(tmp[3]);
I also cleaned up SHA256 I played with the idea of turning that into a for/while loop but that actually dropped performance. The endianswap trick got me thinking about Ch and Maj, but the ordering of the instructions is so interleaved, I wasn't able to get any more performance.
484KH/s from 474KH/s on my 7870 358KH/s from 353KH/s on my 7850
Wait, why are we doing endian swapping in OpenCL code??
You know, Luke, I wondered the same thing when I looked at it.
I looked into it more about a week ago. Looks unavoidable, a design flaw in the scrypt PoW algorithm :(
Fist R9 270X, second HD 7770 "xintensity" : "4,4", "gpu-threads" : "2,1", "algorithm" : "adaptive-n-factor", "gpu-engine" : "1150,1100", "gpu-memclock" : "1250,1250", "thread-concurrency" : "5121,4121",
alexkarold: R9 270X => 252.4Kh/s, HD7770 => 85.34Kh/s http://poiuty.com/img/6a5710719f94c06b257d82e5076b.png
lsoc: R9 270X => 252.4Kh/s, HD7770 => 86.13Kh/s http://poiuty.com/img/879516844048dc1a8cbc7b9bc03b.png For R9 270X try up E = 1170 & M = 1500, get 259Kh/s http://poiuty.com/img/50f4717ab8e8c78f75de3a76dd3b.png
lsoc3: R9 270X => 243.6Kh/s, HD7770 => 87.73Kh/s http://poiuty.com/img/2d941d1932ad6c345a266893eeb1.png
http://sterlingdesktops.com/pub/test/lsoc.cl
I made some fairly extensive changes in this kernel. The binary/code size is smaller than a stock ck/zuikkis. Apparently my 7870/7850 cards really like this, and other cards might too. It works with lookup-gap 1 to 8. You'll likely have to experiment with a different TC value to get peak performance, but I was able to get +10KH/s on my cards.