Poor OpenMP scalability of MultiBit format

kost123 commented 6 years ago

When cracking multibit my cpu usage is only 25%

Fork command is not recognised, thought that might help.

Any way to get it to use 100%?

ethicalhack3r commented 6 years ago

Same issue on macOS High Sierra for me.

The documentation, and several blog posts online, mention uncommenting OMPFLAGS in the Makefile before compilation. But this setting is no longer in the Makefile.

magnumripper commented 6 years ago

$ OMP_NUM_THREADS=1 ../run/john -test -form:multibit 
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... DONE
Raw:    1006K c/s real, 996625 c/s virtual

$ GOMP_CPU_AFFINITY=0-3 OMP_NUM_THREADS=4 ../run/john -test -form:multibit 
Will run 4 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... (4xOMP) DONE
Raw:    2267K c/s real, 771134 c/s virtual

$ GOMP_CPU_AFFINITY=0-7 ../run/john -test -form:multibit 
Will run 8 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... (8xOMP) DONE
Raw:    1455K c/s real, 410032 c/s virtual

Yeah that format seems to scale poorly - and HT make it worse. I'll see if I can optimize some stuff tomorrow.

solardiz commented 6 years ago

@kost123 @ethicalhack3r For builds/settings/workarounds you could use right now (basically, use --fork in a build that supports it), you'd need to ask on the john-users mailing list. On this GitHub issue tracker, we're focused solely on improving bleeding-jumbo, not on providing user support. That said, maybe we'll have a revision of bleeding-jumbo where this format will scale well very soon, as per magnum's comment, and you could use that. Thanks!

magnumripper commented 6 years ago

I did what I could. I think HT now adds a little bit (it will depend on CPU though) instead of lowering performance but scaling is still poor.

solardiz commented 6 years ago

This format supports two very different versions - one is fast, the other is slow (uses scrypt). We should report them as a "tunable cost", and benchmark them separately. Also, we have several formats that use the relatively slow crypto_scrypt() API, which (de)allocates memory each time it's called:

$ fgrep 'crypto_scrypt(' *.c
django_scrypt_fmt_plug.c:       if (crypto_scrypt((unsigned char*)saved_key[index], strlen((char*)saved_key[index]),
ethereum_fmt_plug.c:                crypto_scrypt((unsigned char *)saved_key[index+i],
multibit_fmt_plug.c:            crypto_scrypt((const unsigned char*)password, (len + 1) * 2, salt_hardcoded, 8, 16384, 8, 1, key, 32);

scrypt_fmt.c gets this right by using per-thread "local" struct's with the escrypt_r() API.

But I digress. We should track these issues separately. I guess the poor scaling is mostly for the fast version?

solardiz commented 6 years ago

This improves scalability further:

diff --git a/src/multibit_fmt_plug.c b/src/multibit_fmt_plug.c
index 2774c0b..96e01a4 100644
--- a/src/multibit_fmt_plug.c
+++ b/src/multibit_fmt_plug.c
@@ -20,7 +20,7 @@ john_register_one(&fmt_multibit);
 #ifdef _OPENMP
 #include <omp.h>
 #ifndef OMP_SCALE
-#define OMP_SCALE               128 // just 8 is better for v2 salts
+#define OMP_SCALE               2
 #endif
 #endif

@@ -50,7 +50,7 @@ john_register_one(&fmt_multibit);
 #define SALT_ALIGN              sizeof(uint32_t)
 #define PLAINTEXT_LENGTH        125
 #define MIN_KEYS_PER_CRYPT      1
-#define MAX_KEYS_PER_CRYPT      1
+#define MAX_KEYS_PER_CRYPT      64 // just 4 is better for v2 salts

 static struct fmt_tests multibit_tests[] = {
        // Wallets created by MultiBit Classic 0.5.18
@@ -75,11 +75,13 @@ static struct custom_salt {
 static void init(struct fmt_main *self)
 {
 #ifdef _OPENMP
-       int omp_t = omp_get_num_threads();
+       int omp_t = omp_get_max_threads();

-       self->params.min_keys_per_crypt *= omp_t;
-       omp_t *= OMP_SCALE;
-       self->params.max_keys_per_crypt *= omp_t;
+       if (omp_t > 1) {
+               self->params.min_keys_per_crypt *= omp_t;
+               omp_t *= OMP_SCALE;
+               self->params.max_keys_per_crypt *= omp_t;
+       }
 #endif
        saved_key = mem_calloc(sizeof(*saved_key), self->params.max_keys_per_crypt);
        cracked = mem_calloc(sizeof(*cracked), self->params.max_keys_per_crypt);
@@ -245,7 +247,10 @@ static int crypt_all(int *pcount, struct db_salt *salt)
                AES_KEY aes_decrypt_key;
                int len = strlen(saved_key[index]);

-               cracked[index] = 0;
+#ifdef _OPENMP
+               if (cracked[index]) /* avoid false sharing of nearby elements */
+#endif
+                       cracked[index] = 0;

                if (cur_salt->type == 1) {
                        unsigned char c;

Results:

[solar@super src]$ OMP_NUM_THREADS=1 ../run/john -test -form=multibit
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... DONE
Raw:    907904 c/s real, 907904 c/s virtual

[solar@super src]$ OMP_NUM_THREADS=2 ../run/john -test -form=multibit
Will run 2 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (2xOMP) DONE
Raw:    1825K c/s real, 912896 c/s virtual

[solar@super src]$ OMP_NUM_THREADS=4 ../run/john -test -form=multibit
Will run 4 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (4xOMP) DONE
Raw:    3540K c/s real, 885248 c/s virtual

[solar@super src]$ OMP_NUM_THREADS=8 ../run/john -test -form=multibit
Will run 8 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (8xOMP) DONE
Raw:    6902K c/s real, 862848 c/s virtual

[solar@super src]$ OMP_NUM_THREADS=16 ../run/john -test -form=multibit
Will run 16 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (16xOMP) DONE
Raw:    13000K c/s real, 812544 c/s virtual

[solar@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=multibit
Will run 32 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (32xOMP) DONE
Raw:    15495K c/s real, 484678 c/s virtual

It is possible to do slightly better with another doubling of OMP_SCALE (on top of what's in the diff above), but interactivity suffers and speeds at fewer threads appear to be come less stable (we might be crossing into a next level cache):

[solar@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=multibit
Will run 32 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (32xOMP) DONE
Raw:    16269K c/s real, 509371 c/s virtual

solardiz commented 6 years ago

There were several issues (and many still remain, but I fixed the most pressing ones with the patch above):

Use of omp_get_num_threads() (which just returns 1 in this place) instead of omp_get_max_threads() - lots of formats currently have this bug (probably indicating no one ever bothered tuning their scaling)
The false sharing of cracked[] elements in this fast format (v1 hashes) was really bad (but relatively minor compared to the above bug). My workaround should be sufficient to avoid false write/read sharing in the typical case (no cracks), but we could do better by introducing an any_cracked variable (which we'd only zeroize before the parallel loop), like we have in some other "salt-only" formats.
There's no point in applying OMP_SCALE when we run 1 thread (this applies to many other formats as well).
max_keys_per_crypt = 1 is way too low for a fast format even without OpenMP or when running 1 thread. So the scaling factor should be optimally split (I tuned this) between the initial MAX_KEYS_PER_CRYPT and OMP_SCALE (this applies to many other formats as well).

solardiz commented 6 years ago

The v1 and v2 hashes are so different, and require different tuning, that I suggest we split this format into two, or at least into two struct fmt_main in the same source file.

magnumripper commented 6 years ago

Use of omp_get_num_threads() (which just returns 1 in this place) instead of omp_get_max_threads() - lots of formats currently have this bug (probably indicating no one ever bothered tuning their scaling)

Wow. Good you found that one. We need to review this, should be easy using a grep oneliner.

There's no point in applying OMP_SCALE when we run 1 thread (this applies to many other formats as well).

I'm not so sure about that. For semi fast formats, I recall applying OMP_SCALE can compensate for OpenMP overhead - so OMP_NUM_THREADS=1 will perform close to non-OMP speed.

magnumripper commented 6 years ago

I applied that patch in your name in 00655ad63.

magnumripper commented 6 years ago

Remaining formats with same bug (omp_get_num_threads) fixed in 112d95414.

magnumripper commented 6 years ago

Hmm, some bots now have this

Testing: multibit, MultiBit Wallet [MD5/scrypt AES 32/32]... FAILED (cmp_all(64))

magnumripper commented 6 years ago

Ah, I see the problem, will fix.

magnumripper commented 6 years ago

Fixed in 790d23b24

kholia commented 6 years ago

Thanks @magnumripper. I think this is a regular mistake I make in my code.

solardiz commented 6 years ago

For semi fast formats, I recall applying OMP_SCALE can compensate for OpenMP overhead - so OMP_NUM_THREADS=1 will perform close to non-OMP speed.

For formats this fast, the non-OpenMP max_keys_per_crypt should be high enough to amortize the usual function calling, etc. overhead, and then it'd also amortize the overhead of dummy OpenMP (the extra code prior to the for loops). Maybe optimal max_keys_per_crypt is slightly higher for dummy OpenMP than for non-OpenMP, but that's a minor difference - could be a factor of 2, which is much smaller than we typically set OMP_SCALE to.

So I think we should prefer to have higher default/initial MAX_KEYS_PER_CRYPT over applying OMP_SCALE even in the 1 thread case.

BTW, maybe we move this essentially duplicate code from the many init() methods into some helper function, perhaps in formats.c and perhaps accepting the format's OMP_SCALE as parameter? Right now, to skip the multiplying by OMP_SCALE for the 1 thread case would involve changes to lots of formats, whereas then it would be a change in one shared function.

solardiz commented 6 years ago

I think this specific issue is fixed. We could want to split this format in two (fast vs. slow) under a separate issue. The need to revise/re-tune other formats identified due to this issue should also be worked on separately.

openwall / john

Poor OpenMP scalability of MultiBit format #2846