Have OpenCL auto-tune find optimal LWS that isn't a power of 2

solardiz commented 3 years ago

Experimenting with Bitcoin-opencl, I see that its optimal LWS for the old Titan Kepler is 384 (edit: or actually 768). However, auto-tune goes by powers of 2 and finds 512, resulting in worse speeds and higher GWS. While these specific GPUs don't matter much anymore, perhaps the issue could also be relevant to other/future devices. Perhaps we can do better by parsing device info? We do have 192 right here:

    Device #2 (6) name:     GeForce GTX TITAN
[...]
    Parallel compute cores: 14
    CUDA cores:             2688  (14 x 192)

Auto-tune:

LWS=512 GWS=21504 (42 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    1137 c/s real, 1137 c/s virtual, Dev#6 util: 100%

Manual LWS=384:

LWS=384 GWS=10752 (28 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    1259 c/s real, 1256 c/s virtual, Dev#6 util: 100%

Edit: for comparison, hashcat -b -m 11300 -w4:

Hashmode: 11300 - Bitcoin/Litecoin wallet.dat (Iterations: 200459)

Speed.#3.........:     1397 H/s (418.09ms) @ Accel:16 Loops:512 Thr:1024 Vec:1

This made me try further, and I got this:

LWS=768 GWS=10752 (14 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    1369 c/s real, 1371 c/s virtual, Dev#6 util: 100%

claudioandre-br commented 3 years ago

The culprit is https://github.com/openwall/john/commit/33f1af44dc2fec3f47a962090879d29801537074, Line 1641.

solardiz commented 3 years ago

The culprit is 33f1af4, Line 1641.

Not only that. get_kernel_preferred_multiple() returns 32 for this device, so even without explicit jumping to a power of two our doublings only go over powers of two anyway.

solardiz commented 3 years ago

our doublings only go over powers of two anyway.

And not only that. We also have this:

                if (gws % my_work_group != 0) {

                        if (GET_EXACT_MULTIPLE(gws, my_work_group) > global_work_size)
                            continue;

which I think makes it skip to next power of two for LWS when the GWS is a power of two.

(BTW, also the empty line in there should be deleted. Edit: and continue; indented consistently.)

claudioandre-br commented 3 years ago

I don't know exactly what problem you are facing (I'm testing other format, probably using another device). Anyway, if you remove the offending code [1] and tune using a proper GWS [2], it must work as expected. Or revert to tune using GWS=NULL (the way it used to work).

[1]

diff --git a/src/opencl_common.c b/src/opencl_common.c
index 17ff138..52f2f17 100644
--- a/src/opencl_common.c
+++ b/src/opencl_common.c
@@ -1873,19 +1873,7 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
                        kernelExecTimeNs = sumRunTime;
                        optimal_work_group = my_work_group;
                } else {
-                       if (my_work_group >= 256 ||
-                           (my_work_group >= 8 && wg_multiple < 8)) {
-                               /* Jump to next power of 2 */
-                               size_t x, y;
-                               x = my_work_group;
-                               while ((y = x & (x - 1)))
-                                       x = y;
-                               x *= 2;
-                               my_work_group =
-                                   GET_NEXT_MULTIPLE(x, wg_multiple);
-                               /* The loop logic will re-add wg_multiple */
-                               my_work_group -= wg_multiple;
-                       }
+
                }
        }
        BLOB_FREE(self, binary);

[2]

Calculating best LWS for GWS=10240
Tuning to length 7
Testing LWS=16 GWS=10240 ... 312.716 ms+
Testing LWS=32 GWS=10240 ... 313.250 ms
Testing LWS=64 GWS=10240 ... 313.366 ms
Testing LWS=80 GWS=10240 ... 313.354 ms
Testing LWS=128 GWS=10240 ... 312.818 ms
Testing LWS=160 GWS=10240 ... 313.012 ms
Testing LWS=256 GWS=10240 ... 312.383 ms
Testing LWS=320 GWS=10240 ... 313.390 ms
Testing LWS=512 GWS=10240 ... 312.526 ms
TestDB LWS=16 GWS=10240 (640 blocks) DONE
Speed for cost 1 (iteration count) of 5000
Raw:    1087 c/s real, 2048K c/s virtual

solardiz commented 3 years ago

Well, if I not only remove that code above, but also explicitly set GWS that is a multiple of my desired LWS, then yes, it does find that LWS. Otherwise, it does not. We need to somehow make it find both optimal LWS and optimal GWS by default.

That other logic I complained about above is from ad329dd39032469d998e4ef5e676e87794f26c9f, which keeps GWS from increasing beyond the allocated buffer sizes. Maybe a better fix would have been to lower GWS to the largest-that-still-fits multiple of the LWS being tested. Edit: lower it just for this one loop iteration, then revert, so that such lowers don't unnecessarily persist/accumulate in case we end up choosing an LWS for which they wouldn't be needed.

solardiz commented 3 years ago

Maybe a better fix would have been to lower GWS to the largest-that-still-fits multiple of the LWS being tested. Edit: lower it just for this one loop iteration, then revert

This is achieved by:

+++ b/src/opencl_common.c
@@ -1793,10 +1793,9 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,

                global_work_size = gws;
                if (gws % my_work_group != 0) {
-
-                       if (GET_EXACT_MULTIPLE(gws, my_work_group) > global_work_size)
-                           continue;
                        global_work_size = GET_EXACT_MULTIPLE(gws, my_work_group);
+                       while (global_work_size > gws)
+                               global_work_size -= my_work_group;
                }

                if (options.verbosity > VERB_LEGACY)
@@ -1873,6 +1872,7 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
                        kernelExecTimeNs = sumRunTime;
                        optimal_work_group = my_work_group;
                } else {
+#if 0
                        if (my_work_group >= 256 ||
                            (my_work_group >= 8 && wg_multiple < 8)) {
                                /* Jump to next power of 2 */
@@ -1886,6 +1886,7 @@ void opencl_find_best_lws(size_t group_size_limit, int sequential_id,
                                /* The loop logic will re-add wg_multiple */
                                my_work_group -= wg_multiple;
                        }
+#endif
                }
        }
        BLOB_FREE(self, binary);

However, the result is not pretty - the lower GWS affects speeds, too, and it's not the final GWS that would be used yet, so we end up with an LWS that's affected by those temporary GWS adjustments:

Calculating best LWS for GWS=4096
Testing LWS=32 GWS=4096 ... 255.532 ms+
Testing LWS=64 GWS=4096 ... 255.138 ms
Testing LWS=96 GWS=4032 ... 254.437 ms+
Testing LWS=128 GWS=4096 ... 254.840 ms
Testing LWS=160 GWS=4000 ... 257.221 ms
Testing LWS=192 GWS=4032 ... 256.718 ms
Testing LWS=224 GWS=4032 ... 280.646 ms
Testing LWS=256 GWS=4096 ... 256.307 ms
Testing LWS=288 GWS=4032 ... 255.334 ms
Testing LWS=320 GWS=3840 ... 254.026 ms
Testing LWS=352 GWS=3872 ... 252.773 ms+
Testing LWS=384 GWS=3840 ... 251.785 ms+
Testing LWS=416 GWS=3744 ... 250.571 ms+
Testing LWS=448 GWS=4032 ... 250.235 ms
Testing LWS=480 GWS=3840 ... 246.306 ms+
Testing LWS=512 GWS=4096 ... 246.618 ms
Testing LWS=544 GWS=3808 ... 280.583 ms
Testing LWS=576 GWS=4032 ... 274.732 ms
Testing LWS=608 GWS=3648 ... 277.180 ms
Testing LWS=640 GWS=3840 ... 275.170 ms
Testing LWS=672 GWS=4032 ... 319.777 ms
Testing LWS=704 GWS=3520 ... 316.054 ms
Testing LWS=736 GWS=3680 ... 314.608 ms
Testing LWS=768 GWS=3840 ... 314.644 ms
Calculating best GWS for LWS=480; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
xfer: 327.072 us, init: 215.136 us, loop: 100x125.172 ms, final: 172.256 us, xfer: 8.416 us
gws:     13440     1071 c/s   214692660 rounds/s    12.546 s per crypt_all()!
xfer: 814.400 us, init: 422.784 us, loop: 100x250.318 ms (exceeds 200 ms)
xfer: 145.120 us, init: 110.208 us, loop: 100x62.922 ms, final: 109.536 us, xfer: 4.288 us
gws:      6720     1065 c/s   213489900 rounds/s     6.307 s per crypt_all()-
LWS=480 GWS=13440 (28 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    1068 c/s real, 1068 c/s virtual, Dev#6 util: 100%

I am puzzled that lowering GWS from 4096 to 3840 hurts speed at LWS=768 so much, but apparently it does, so a much lower and weirder LWS ends up being chosen, then GWS is tuned for that one.

solardiz commented 3 years ago

I am puzzled that lowering GWS from 4096 to 3840 hurts speed at LWS=768 so much

This is probably not the correct interpretation of what happened. Maybe speeds at LWS=768 were always worse at that low and non-optimal GWS. If so, the issue isn't so much that we skipped testing this LWS, but also that we wouldn't have detected it as optimal when testing with a lower and far from final GWS.

Maybe the very first run finding initial best GWS should have already used a larger LWS. We know this device has 192 "CUDA cores" per SMX. Can't we just query this and use 192 for the initial LWS for finding the initial best GWS? Then the run finding best LWS can go from a smaller LWS like 32, but will use a more optimal GWS. Probably using that larger GWS in there would make the auto-tuning substantially slower, though.

solardiz commented 3 years ago

Maybe the very first run finding initial best GWS should have already used a larger LWS.

As an experiment, I tried hard-coding it as 192. Got initial GWS=6144 for finding best LWS. Ended up with LWS=480 again, with 768 being slower at that GWS. So no, that's not the solution.

solardiz commented 3 years ago

What puzzles me is that even if I set initial LWS to 768, the initial GWS finding ends up with GWS=6144 and not higher. This then leads to LWS=768 appearing non-optimal, and LWS=480 being chosen. Is the initial GWS finding run also limited in some other way causing this? Looks so, as the below hack of setting have_lws = 1; got around that issue:

 void opencl_find_best_gws(int step, int max_duration,
                           int sequential_id, unsigned int rounds, int have_lws)
 {
+       if (!have_lws)
+               local_work_size = 768;
+       have_lws = 1;

However, 736 ended up looking very slightly more optimal than 768, ultimately resulting in slightly lower speed at final GWS (and excessive final GWS):

Testing LWS=736 GWS=10304 ... 156.638 ms+
Testing LWS=768 GWS=10752 ... 156.622 ms
Calculating best GWS for LWS=736; max. 200 ms single kernel invocation.
Raw speed figures including buffer transfers:
xfer: 2.095 ms, init: 724.896 us, loop: 100x469.748 ms (exceeds 200 ms)
xfer: 1.043 ms, init: 366.304 us, loop: 100x234.908 ms (exceeds 200 ms)
LWS=736 GWS=61824 (84 blocks) DONE
Speed for cost 1 (iteration count) of 200460
Raw:    1312 c/s real, 1312 c/s virtual, Dev#6 util: 100%

Weird stuff, and feels like a waste of time trying to fix it. However, I wonder how hashcat manages to hit better speeds right away? Maybe its better speeds are of a different cause, and can be made better yet if it also used better tuned LWS?

magnumripper commented 3 years ago

We do have 192 right here:

    Device #2 (6) name:     GeForce GTX TITAN
[...]
    Parallel compute cores: 14
    CUDA cores:             2688  (14 x 192)

I too have experimented with this in the past, thinking it should be of some value, but I always ended up with inconclusive results when trying to generalize it. I still think "it should" be of value, I just don't know how.

As for hashcat I'm not sure right now but I think they have presets for common cards, perhaps with some run-time tune (or maybe not?). I'm really fond of the concept of unprejudiced auto-tune so I think we should keep it and try to make it better (just as you do here).

openwall / john

Have OpenCL auto-tune find optimal LWS that isn't a power of 2 #4696