Closed kost123 closed 6 years ago
Same issue on macOS High Sierra for me.
The documentation, and several blog posts online, mention uncommenting OMPFLAGS
in the Makefile before compilation. But this setting is no longer in the Makefile.
$ OMP_NUM_THREADS=1 ../run/john -test -form:multibit
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... DONE
Raw: 1006K c/s real, 996625 c/s virtual
$ GOMP_CPU_AFFINITY=0-3 OMP_NUM_THREADS=4 ../run/john -test -form:multibit
Will run 4 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... (4xOMP) DONE
Raw: 2267K c/s real, 771134 c/s virtual
$ GOMP_CPU_AFFINITY=0-7 ../run/john -test -form:multibit
Will run 8 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5 AES 32/64]... (8xOMP) DONE
Raw: 1455K c/s real, 410032 c/s virtual
Yeah that format seems to scale poorly - and HT make it worse. I'll see if I can optimize some stuff tomorrow.
@kost123 @ethicalhack3r For builds/settings/workarounds you could use right now (basically, use --fork
in a build that supports it), you'd need to ask on the john-users mailing list. On this GitHub issue tracker, we're focused solely on improving bleeding-jumbo, not on providing user support. That said, maybe we'll have a revision of bleeding-jumbo where this format will scale well very soon, as per magnum's comment, and you could use that. Thanks!
I did what I could. I think HT now adds a little bit (it will depend on CPU though) instead of lowering performance but scaling is still poor.
This format supports two very different versions - one is fast, the other is slow (uses scrypt). We should report them as a "tunable cost", and benchmark them separately. Also, we have several formats that use the relatively slow crypto_scrypt()
API, which (de)allocates memory each time it's called:
$ fgrep 'crypto_scrypt(' *.c
django_scrypt_fmt_plug.c: if (crypto_scrypt((unsigned char*)saved_key[index], strlen((char*)saved_key[index]),
ethereum_fmt_plug.c: crypto_scrypt((unsigned char *)saved_key[index+i],
multibit_fmt_plug.c: crypto_scrypt((const unsigned char*)password, (len + 1) * 2, salt_hardcoded, 8, 16384, 8, 1, key, 32);
scrypt_fmt.c
gets this right by using per-thread "local" struct's with the escrypt_r()
API.
But I digress. We should track these issues separately. I guess the poor scaling is mostly for the fast version?
This improves scalability further:
diff --git a/src/multibit_fmt_plug.c b/src/multibit_fmt_plug.c
index 2774c0b..96e01a4 100644
--- a/src/multibit_fmt_plug.c
+++ b/src/multibit_fmt_plug.c
@@ -20,7 +20,7 @@ john_register_one(&fmt_multibit);
#ifdef _OPENMP
#include <omp.h>
#ifndef OMP_SCALE
-#define OMP_SCALE 128 // just 8 is better for v2 salts
+#define OMP_SCALE 2
#endif
#endif
@@ -50,7 +50,7 @@ john_register_one(&fmt_multibit);
#define SALT_ALIGN sizeof(uint32_t)
#define PLAINTEXT_LENGTH 125
#define MIN_KEYS_PER_CRYPT 1
-#define MAX_KEYS_PER_CRYPT 1
+#define MAX_KEYS_PER_CRYPT 64 // just 4 is better for v2 salts
static struct fmt_tests multibit_tests[] = {
// Wallets created by MultiBit Classic 0.5.18
@@ -75,11 +75,13 @@ static struct custom_salt {
static void init(struct fmt_main *self)
{
#ifdef _OPENMP
- int omp_t = omp_get_num_threads();
+ int omp_t = omp_get_max_threads();
- self->params.min_keys_per_crypt *= omp_t;
- omp_t *= OMP_SCALE;
- self->params.max_keys_per_crypt *= omp_t;
+ if (omp_t > 1) {
+ self->params.min_keys_per_crypt *= omp_t;
+ omp_t *= OMP_SCALE;
+ self->params.max_keys_per_crypt *= omp_t;
+ }
#endif
saved_key = mem_calloc(sizeof(*saved_key), self->params.max_keys_per_crypt);
cracked = mem_calloc(sizeof(*cracked), self->params.max_keys_per_crypt);
@@ -245,7 +247,10 @@ static int crypt_all(int *pcount, struct db_salt *salt)
AES_KEY aes_decrypt_key;
int len = strlen(saved_key[index]);
- cracked[index] = 0;
+#ifdef _OPENMP
+ if (cracked[index]) /* avoid false sharing of nearby elements */
+#endif
+ cracked[index] = 0;
if (cur_salt->type == 1) {
unsigned char c;
Results:
[solar@super src]$ OMP_NUM_THREADS=1 ../run/john -test -form=multibit
Warning: OpenMP is disabled; a non-OpenMP build may be faster
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... DONE
Raw: 907904 c/s real, 907904 c/s virtual
[solar@super src]$ OMP_NUM_THREADS=2 ../run/john -test -form=multibit
Will run 2 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (2xOMP) DONE
Raw: 1825K c/s real, 912896 c/s virtual
[solar@super src]$ OMP_NUM_THREADS=4 ../run/john -test -form=multibit
Will run 4 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (4xOMP) DONE
Raw: 3540K c/s real, 885248 c/s virtual
[solar@super src]$ OMP_NUM_THREADS=8 ../run/john -test -form=multibit
Will run 8 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (8xOMP) DONE
Raw: 6902K c/s real, 862848 c/s virtual
[solar@super src]$ OMP_NUM_THREADS=16 ../run/john -test -form=multibit
Will run 16 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (16xOMP) DONE
Raw: 13000K c/s real, 812544 c/s virtual
[solar@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=multibit
Will run 32 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (32xOMP) DONE
Raw: 15495K c/s real, 484678 c/s virtual
It is possible to do slightly better with another doubling of OMP_SCALE
(on top of what's in the diff above), but interactivity suffers and speeds at fewer threads appear to be come less stable (we might be crossing into a next level cache):
[solar@super src]$ GOMP_CPU_AFFINITY=0-31 ../run/john -test -form=multibit
Will run 32 OpenMP threads
Benchmarking: multibit, MultiBit Wallet [MD5/scrypt AES 32/64]... (32xOMP) DONE
Raw: 16269K c/s real, 509371 c/s virtual
There were several issues (and many still remain, but I fixed the most pressing ones with the patch above):
Use of omp_get_num_threads()
(which just returns 1 in this place) instead of omp_get_max_threads()
- lots of formats currently have this bug (probably indicating no one ever bothered tuning their scaling)
The false sharing of cracked[]
elements in this fast format (v1 hashes) was really bad (but relatively minor compared to the above bug). My workaround should be sufficient to avoid false write/read sharing in the typical case (no cracks), but we could do better by introducing an any_cracked
variable (which we'd only zeroize before the parallel loop), like we have in some other "salt-only" formats.
There's no point in applying OMP_SCALE
when we run 1 thread (this applies to many other formats as well).
max_keys_per_crypt = 1
is way too low for a fast format even without OpenMP or when running 1 thread. So the scaling factor should be optimally split (I tuned this) between the initial MAX_KEYS_PER_CRYPT
and OMP_SCALE
(this applies to many other formats as well).
The v1 and v2 hashes are so different, and require different tuning, that I suggest we split this format into two, or at least into two struct fmt_main
in the same source file.
- Use of
omp_get_num_threads()
(which just returns 1 in this place) instead ofomp_get_max_threads()
- lots of formats currently have this bug (probably indicating no one ever bothered tuning their scaling)
Wow. Good you found that one. We need to review this, should be easy using a grep oneliner.
- There's no point in applying
OMP_SCALE
when we run 1 thread (this applies to many other formats as well).
I'm not so sure about that. For semi fast formats, I recall applying OMP_SCALE can compensate for OpenMP overhead - so OMP_NUM_THREADS=1 will perform close to non-OMP speed.
I applied that patch in your name in 00655ad63.
Remaining formats with same bug (omp_get_num_threads) fixed in 112d95414.
Hmm, some bots now have this
Testing: multibit, MultiBit Wallet [MD5/scrypt AES 32/32]... FAILED (cmp_all(64))
Ah, I see the problem, will fix.
Fixed in 790d23b24
Thanks @magnumripper. I think this is a regular mistake I make in my code.
For semi fast formats, I recall applying OMP_SCALE can compensate for OpenMP overhead - so OMP_NUM_THREADS=1 will perform close to non-OMP speed.
For formats this fast, the non-OpenMP max_keys_per_crypt should be high enough to amortize the usual function calling, etc. overhead, and then it'd also amortize the overhead of dummy OpenMP (the extra code prior to the for
loops). Maybe optimal max_keys_per_crypt is slightly higher for dummy OpenMP than for non-OpenMP, but that's a minor difference - could be a factor of 2, which is much smaller than we typically set OMP_SCALE to.
So I think we should prefer to have higher default/initial MAX_KEYS_PER_CRYPT over applying OMP_SCALE even in the 1 thread case.
BTW, maybe we move this essentially duplicate code from the many init()
methods into some helper function, perhaps in formats.c
and perhaps accepting the format's OMP_SCALE as parameter? Right now, to skip the multiplying by OMP_SCALE for the 1 thread case would involve changes to lots of formats, whereas then it would be a change in one shared function.
I think this specific issue is fixed. We could want to split this format in two (fast vs. slow) under a separate issue. The need to revise/re-tune other formats identified due to this issue should also be worked on separately.
When cracking multibit my cpu usage is only 25%
Fork command is not recognised, thought that might help.
Any way to get it to use 100%?