LM problem w/ -fork - Githubissues

frank-dittrich commented 10 years ago

./jtrts.pl -noprelims -passthru '--fork=3'
...
form=pwdump_lm                    guesses: 1122 -show=2757 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
.pot CHK:pwdump_lm                guesses:  983 -show=2757 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]
...
Some tests had Errors. Performed 280 tests.1 errors  1 errors reprocessing the .POT files
Time used was 465 seconds

This is with latest jtrts commit (19ca304106ebebb5d0d9b717adc2cb4626cc9808) and latest bleeding-jumbo commit (445440621046cfb3f0d7efe3fb1e062385f6ffb4)

$ ./john --list=build-info 
Version: 1.8.0.2-bleeding-jumbo
Build: linux-gnu 64-bit AVX-autoconf
Arch: 64-bit LE
$JOHN is ./
Format interface version: 12
Max. number of reported tunable costs: 3
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
Max. Markov mode level: 400
Max. Markov mode password length: 30
Compiler version: 4.2.1 Compatible Clang 3.4 (tags/RELEASE_34/final)
gcc version: 4.2.1
clang version: 3.4 (tags/RELEASE_34/final)
OpenSSL library version: 01000105f
OpenSSL 1.0.1e-fips 11 Feb 2013
GMP library version: 5.1.2
NSS library version: 3.17 Basic ECC (loaded: 3.17 Extended ECC)
NSPR library version: 4.10.7
Kerberos version 5 support enabled
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's

frank-dittrich commented 10 years ago

I get

form=pwdump_lm                    guesses: 1125 0:00:00:00 DONE  [PASSED]
.pot CHK:pwdump_lm                guesses:  981 -show=2734 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

with a GCC build (OMP enabled) when I pass --fork=3.

 $ ./john --list=build-info 
Version: 1.8.0.2-bleeding-jumbo OMP
Build: linux-gnu 64-bit AVX-autoconf
Arch: 64-bit LE
$JOHN is ./
Format interface version: 12
Max. number of reported tunable costs: 3
Rec file version: REC4
Charset file version: CHR3
CHARSET_MIN: 1 (0x01)
CHARSET_MAX: 255 (0xff)
CHARSET_LENGTH: 24
Max. Markov mode level: 400
Max. Markov mode password length: 30
Compiler version: 4.8.3 20140911 (Red Hat 4.8.3-7)
gcc version: 4.8.3
OpenSSL library version: 01000105f
OpenSSL 1.0.1e-fips 11 Feb 2013
GMP library version: 5.1.2
NSS library version: 3.17 Basic ECC (loaded: 3.17 Extended ECC)
NSPR library version: 4.10.7
Kerberos version 5 support enabled
fseek(): fseek
ftell(): ftell
fopen(): fopen
memmem(): System's

jfoug commented 10 years ago

I get lm fails on GCC build, OMP -fork=3

JtR/bleed/test$ ./jtrts.pl -noprelims -passthru '--fork=3' -q ....... form=lm guesses: 1493 -show=2992 0:00:00:00 DONE : Expected count(s) (1500)(-show3000) [!!!FAILED!!!] .pot CHK:lm guesses: 1487 -show=2987 0:00:00:00 DONE : Expected count(s) (1500) [!!!FAILED!!!]

NOTE, if I just run against lm, I get LM failures, then LM working, then lm_pwdump working sometimes and failing others.

The runs are inconsistant. I do not think this is a problem with TS, it is something between LM and fork mode.

magnumripper commented 10 years ago

I can't reproduce (on x86-64 OSX). Note that the exact number for "guesses" is supposed to vary "randomly" depending on how many dupe cracks you happened to get, but the -show figure should not vary at all.

In general, issues like this "can't" really be due to -fork. Basically a forked process is no different than a -node process - there is virtually no fork specific code anywhere in john after the cracking has started. But -fork may reveal a bug that happened to hide without it.

jfoug commented 10 years ago

Run this over and over again, and see if you get failures at times:

./jtrts.pl -noprelims -passthru '--fork=3' lm

On my ubuntu-64 (OMP build), sometimes it works properly, sometimes not. Yes, I understand that guesses will vary, but show should not, but I am seeing that at times jtr is NOT cracking all the hashes. I have not looked any deeper than showing that I see the exact same thing Frank is seeing.

magnumripper commented 10 years ago

Nope, but I made a for loop with increasing fork count and found I can trigger it with sufficiently high number of forks.

$ ./jtrts.pl -noprelims -passthru '--fork=9' lm --stoponerror
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [darwin13.4.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1503 -show=2997 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
Exiting on error.  The pot file ./tst.pot contains the found data
The command used to run this test was:
../run/john -ses=./tst --fork=9  -pot=./tst.pot LM_tst.in --wordlist=pw.dic -form=lm

$ ../run/john -pot=./tst.pot LM_tst.in -form:lm -show=left
u540-LM:2:$LM$8756e789c3e08314
u441-LM:2:$LM$34acd11f14f1a3ec
u108-LM:2:$LM$11028e91b87b43a5

1497 password hashes cracked, 3 left

Indeed, three hashes are not cracked. So this is some problem in JtR.

I can't seem to trigger the problem when using -node=1/9 instead of -fork=9... and I did try 2/9 and so on too.

magnumripper commented 10 years ago

I have verified that this problem exist in versions prior to our recently added self-tests. So they are not causing it.

magnumripper commented 10 years ago

What scares me the most is that this is not likely a problem with LM at all. It might be wordlist mode vs. node/fork, or perhaps pot sync.

frank-dittrich commented 10 years ago

Yes, I also get a varying number of guesses when I repeatedly run

for i in `seq 2 40`; do echo $i; ./jtrts.pl -noprelims -q -passthru "--fork=$i" lm --stoponerror; done

1st run:

2
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=pwdump_lm                    guesses:  982 -show=2756 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
...
40
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=lm                           guesses: 1481 -show=2980 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]

2nd run:

2
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
All tests passed without error.  Performed 2 tests.  Time used was 1 seconds
...
40
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.15, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [linux-gnu 64-bit AVX-autoconf]
--------------------------------------------------------------------------------
form=lm                           guesses: 1511 -show=2995 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]

magnumripper commented 10 years ago

This is too hard to debug using gdb or the like. We should probably use git-bisect.

magnumripper commented 10 years ago

Core John does not seem to have the issue (but TS has an issue https://github.com/magnumripper/jtrTestSuite/issues/17 when trying)

magnumripper commented 10 years ago

git-bisect is a piece of work. Here's the offending commit: 1f827c7a

Ouch! This doesn't make it easy.

magnumripper commented 10 years ago

OK so it happens during -fork with a shared memory mapped buffer but per-node indexes (of each node's words in that shared buffer). But how can it be intermittent!?

frank-dittrich commented 10 years ago

And what causes only LM tests to fail? For --fork=2 ... --fork=14, I only see lm and pwdump_lm tests fail.

magnumripper commented 10 years ago

Not sure if that's worth contemplating yet.

I tried disabling mmap() and the bug is still there. So it's not about the shared memory map but the per-node index buffer. That's a relief.

magnumripper commented 10 years ago

I was wrong - we don't use any index buffer unless rules are involved. So what we have is a shared memory buffer, and mgetl() instead of fgetl(). But I tried disabling mmap() and bug still shows, which means we are just using fgetl() and no magic. We practically do not use anything that was touched in that commit. So what the heck is wrong!? And how come it was introduced with that patch?

magnumripper commented 10 years ago

OK, it's not that commit! If I check out the commit before it, and add -mem=1 to disable memory buffering, the bug shows up!

./jtrts.pl -noprelims -passthru '--fork=9 -mem=1' lm
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [macosx-x86-64]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1506 -show=2334 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
.pot CHK:lm                       guesses:  834 -show=2334 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]

form=pwdump_lm                    guesses: 1276 -show=1571 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
.pot CHK:pwdump_lm                guesses:  318 -show=1571 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

Some tests had Errors. Performed 2 tests.2 errors  2 errors reprocessing the .POT files
Time used was 2 seconds

Back to bisecting. But this was a relief!!

magnumripper commented 10 years ago

0b870fa is bad. The problem is I do not have any good commit because 0b870fa is the very merge from core that implemented -fork. So we can't bisect this. I need to re-test core.

magnumripper commented 10 years ago

I really can't trigger the bug in core.

magnumripper commented 10 years ago

OK I'm puzzled. I now tested bleeding-jumbo but with a wordlist.c from core John, just modified a tad for autoconf so it builds.

$ ./jtrts.pl -noprelims -passthru '--fork=9' lm 
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo_omp [darwin13.4.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1504 -show=2998 0:00:00:00 DONE : Expected count(s) (1500)(-show3000)  [!!!FAILED!!!]
.pot CHK:lm                       guesses: 1498 -show=2998 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]

form=pwdump_lm                    guesses: 1372 0:00:00:00 DONE  [PASSED]
.pot CHK:pwdump_lm                guesses:  984 -show=2758 0:00:00:00 DONE : Expected count(s) (986)  [!!!FAILED!!!]

Some tests had Errors. Performed 2 tests.1 errors  2 errors reprocessing the .POT files
Time used was 2 seconds

Whatever the bug is, it's not in wordlist.c

magnumripper commented 10 years ago

bd10920 is the latest merge from core. That's the version to focus on - what differences do we have that can have this result?

LM_fmt.c has no differences against core
wordlist.c has differences, but using a wordlist.c from core still show the bug

magnumripper commented 10 years ago

wordlist.c has differences, but using a wordlist.c from core still show the bug

Not true! If I use wordlist.c from core, in a bleeding-jumbo of bd10920, bug goes away. To use it I had to insert this line:

diff --git a/src/wordlist.c b/src/wordlist.c
index 064b33d..9a4b855 100644
--- a/src/wordlist.c
+++ b/src/wordlist.c
@@ -19,6 +19,7 @@
 #include "params.h"
 #include "common.h"
 #include "path.h"
+#include "os.h"
 #include "signals.h"
 #include "loader.h"
 #include "logger.h"

Weirdest thing, the same operation does not make the bug go away in HEAD of bleeding-jumbo.

magnumripper commented 10 years ago

I officially give up here, for now.

magnumripper commented 10 years ago

~~Heureka, I found a way to trigger it without using fork.~~

No, that just showed I'm too tired. I will give up now.

frank-dittrich commented 10 years ago

IMHO its not an issue with locking the pot file. I ran

$ ./john --format=lm ../test/LM_tst.in --wordlist=../test/pw.dic --fork=11 > john.stdout ; sort -u john.pot |wc -l
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-11 of 11 (fork)
2 138g 0:00:00:00 DONE (2014-09-26 13:40) 6900g/s 26650p/s 26650c/s 39975KC/s SKIPPIN..R1234B5
Press 'q' or Ctrl-C to abort, almost any other key for status
10 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s B123456..Z123456
7 135g 0:00:00:00 DONE (2014-09-26 13:40) 4500g/s 17766p/s 17766c/s 24198KC/s Z123456..W123456
8 134g 0:00:00:00 DONE (2014-09-26 13:40) 4466g/s 17766p/s 17766c/s 24198KC/s LATE*NI..X123456
5 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s 1234567..U123456
9 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s ..Y123456
6 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s BERT*ER..V123456
3 136g 0:00:00:00 DONE (2014-09-26 13:40) 4533g/s 17766p/s 17766c/s 24198KC/s BERT$ER..S12345C
4 135g 0:00:00:00 DONE (2014-09-26 13:40) 4500g/s 17766p/s 17766c/s 24198KC/s BERT$ER..T123456
11 137g 0:00:00:00 DONE (2014-09-26 13:40) 4566g/s 17766p/s 17766c/s 24198KC/s  ..0A23456
1 127g 0:00:00:00 DONE (2014-09-26 13:40) 3175g/s 13325p/s 13325c/s 2052KC/s SKIPPIN..Q123A45
Waiting for 10 children to terminate
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1486

and

$ ./john --format=lm ../test/LM_tst.in --wordlist=../test/pw.dic > john.stdout2 ; sort -u john.pot |wc -l
Warning: poor OpenMP scalability for this hash type, consider --fork=8
Will run 8 OpenMP threads
Press 'q' or Ctrl-C to abort, almost any other key for status
14g 0:00:00:00 DONE (2014-09-26 13:41) 350.0g/s 146575p/s 146575c/s 2052KC/s SKIPPIN..0A23456
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1500

Let's see what has not been cracked in the --fork=11 run:

$ cat john.stdout2 
Loaded 1500 password hashes with no different salts (LM [DES 128/128 AVX-16])
Remaining 14 password hashes with no different salts
AB1266           (u1266-LM:2)
AB1244           (u1244-LM:2)
AB1086           (u1086-LM:2)
AB936            (u936-LM:2)
AB760            (u760-LM:2)
AB755            (u755-LM:2)
AB723            (u723-LM:2)
AB716            (u716-LM:2)
AB540            (u540-LM:2)
AB441            (u441-LM:2)
AB268            (u268-LM:2)
AB265            (u265-LM:2)
AB232            (u232-LM:2)
AB78             (u78-LM:2)

$ grep AB1266 john.stdout*
john.stdout2:AB1266           (u1266-LM:2)

$ grep u1244-LM:2 john.stdout*
john.stdout2:AB1244           (u1244-LM:2)

frank-dittrich commented 10 years ago

I can reproduce the bug with Markov mode.

$ cut -d: -f 2- john.pot > lm.txt
$ ./calc_stat lm.txt lm-stats
zero -10*log proba1[65] (1499) / 1500 converted to 1
zero -10*log proba2[65*256+66] (1499) / 1499, converted to 1 to prevent infinite length candidates
zero -10*log proba2[75*256+73] (1) / 1, converted to 1 to prevent infinite length candidates
zero -10*log proba2[83*256+75] (1) / 1, converted to 1 to prevent infinite length candidates
$ ./mkvcalcproba lm-stats lm.txt |cut -f 3|sort -n|uniq -c
reading stats ... [0x1ac5010]
119 lines parsed [0x1ac5010]
scanning password file ...
freeing stuff ...
charsetsize = 17
      1 10
     13 28
      5 41
      5 46
     27 50
     93 51
      5 59
      9 63
     46 64
     24 68
     76 69
     27 72
    402 73
    650 74
      1 75
      5 77
     24 81
     86 82
      1 99

Then I created a Markov mode that should crack 1498 out of those 1500 passwords:

 [Markov:lm]
Statsfile = $JOHN/lm-stats
MkvLvl = 82
MkvMaxLen = 7
MkvMinLvl = 28
; MkvMinLen = 0

I verified that this Markov mode works as expected:

$ ./john --format=lm ../test/LM_tst.in --markov=lm > /dev/null ; sort -u john.pot |wc -l;     
Warning: poor OpenMP scalability for this hash type, consider --fork=8
Will run 8 OpenMP threads
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538)
1498g 0:00:00:00 DONE (2014-09-26 15:51) 49933g/s 51133p/s 51133c/s 76700KC/s AB1000..S
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1498

Then I tried --fork:

$ for i in `seq 2 40`; do echo $i; rm john.pot ; ./john --format=lm ../test/LM_tst.in --markov=lm --fork=$i > /dev/null ; sort -u john.pot |wc -l; done
...
4
Will run 2 OpenMP threads per process (8 total across 4 processes)
Node numbers 1-4 of 4 (fork)
2 367g 0:00:00:00 DONE (2014-09-26 15:53) 12233g/s 13100p/s 13100c/s 19650KC/s AB1341..AB322
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
3 383g 0:00:00:00 DONE (2014-09-26 15:53) 12766g/s 13133p/s 13133c/s 14906KC/s AB314..AB669
1 375g 0:00:00:00 DONE (2014-09-26 15:53) 9375g/s 9850p/s 9850c/s 3851KC/s AB1000..AB1350
Waiting for 3 children to terminate
4 373g 0:00:00:00 DONE (2014-09-26 15:53) 9325g/s 9600p/s 9600c/s 10896KC/s AB660..S
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1486
5
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-5 of 5 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
4 317g 0:00:00:00 DONE (2014-09-26 15:53) 15850g/s 15850p/s 15850c/s 23775KC/s AB452..AB737
2 296g 0:00:00:00 DONE (2014-09-26 15:53) 14800g/s 15850p/s 15850c/s 23775KC/s AB1272..AB189
3 310g 0:00:00:00 DONE (2014-09-26 15:53) 15500g/s 15800p/s 15800c/s 23700KC/s AB1814..AB460
5 303g 0:00:00:00 DONE (2014-09-26 15:53) 15150g/s 15400p/s 15400c/s 23100KC/s AB729..S
1 312g 0:00:00:00 DONE (2014-09-26 15:53) 15600g/s 15850p/s 15850c/s 23775KC/s AB1000..AB1281
Waiting for 4 children to terminate
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1498
6
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-6 of 6 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
MKV start (stats=$JOHN/lm-stats, lvl=28-82 len=7 pwd=1538 split over nodes)
2 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB1226..AB1465
3 239g 0:00:00:00 DONE (2014-09-26 15:53) 11950g/s 13250p/s 13250c/s 19875KC/s AB1457..AB322
4 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB314..AB553
1 248g 0:00:00:00 DONE (2014-09-26 15:53) 12400g/s 13300p/s 13300c/s 3511KC/s AB1000..AB1235
Waiting for 5 children to terminate
6 251g 0:00:00:00 DONE (2014-09-26 15:53) 25100g/s 25600p/s 25600c/s 38400KC/s AB776..S
5 266g 0:00:00:00 DONE (2014-09-26 15:53) 13300g/s 13300p/s 13300c/s 19950KC/s AB545..AB784
Warning: passwords printed above might be partial and not be all those cracked
Use the "--show" option to display all of the cracked passwords reliably
Session completed
1496
...

As with wordlist mode, the number of cracked passwords varies when I run the same command repeatedly.

jfoug commented 10 years ago

Should we focus on the obvious? LM built with OMP has some issues (somehow) with -fork

frank-dittrich commented 10 years ago

With --single=none and user name = password,

AB251:$LM$0004c62411e4e2a4
...

I am not able to reproduce the error. I guess that's due to the fact that in this case it's not the list of password candidates that gets split across "nodes".

When I change 1499 user names into AB and use this as single mode

[List.Rules:lm]
:
$[0-9]
Az"[1-90][0-9]"
Az"[1-90][0-9][0-9]"
Az"1[0-4][0-9][0-9]"

I can reproduce the error using single mode:

$ for i in `seq 2 40`; do echo -n $i; echo -n " "; rm john.pot ; ./john --format=lm lm.hashes --single:lm --fork=$i > /dev/null 2> /dev/null ; sort -u john.pot |wc -l; done
2 1496
3 1494
4 1488
5 1484
6 1491
7 1495
8 1493
9 1491
10 1486
11 1474
12 1492
13 1479
14 1488
15 1478
16 1492
17 1490
18 1495
19 1491
20 1494
21 1496
22 1490
23 1496
24 1495
25 1489
26 1498
27 1491
28 1493
29 1492
30 1497
31 1480
32 1493
33 1477
34 1494
35 1489
36 1495
37 1485
38 1479
39 1479
40 1497

frank-dittrich commented 10 years ago

On 09/26/2014 04:07 PM, JimF wrote:

Should we focus on the obvious? LM built with OMP has some issues (somehow) with -fork?

Since I reproduced this with clang (which doesn't support OMP), it is not related to OMP.

And so far we couldn't reproduce it in core. So, LM triggers it either because it is extremely fast (which helps to trigger race conditions) or may be because hashes are split into 2 halves. If it is because hashes are split, we should be able do reproduce this with bigcrypt (descrypt) hashes. But I doubt that is the issue.

frank-dittrich commented 10 years ago

BTW: I can reproduce the issue not just with the pw.dic from jtrts, but also with this file (containing almost 10 million password candidates of length up to 7 and converted to upper case:

$ wc -l LM-shuf.txt 
9384287 LM-shuf.txt

I created it like this:

cut -b 1-7 rockyou.txt ../test/pw.dic |tr a-z A-Z |LC_ALL=C sort -u | shuf -o LM-shuf.txt

So the error does not depend on the high frequency of valid password candidates in the worlist file.

magnumripper commented 10 years ago

Hmm so we have indications we only miss right-side hash halves? That is very interesting. We could make a file with all the same hashes but in raw single-hash-per-line format instead of pwdump and see what happens.

magnumripper commented 10 years ago

Nope, that was not it. Same problem. Also, using an infile completely lacking dupe halves does not help.

magnumripper commented 10 years ago

Maybe we should make a much bigger input file, and run it with lots of threads many times with core John so we really rule it out.

magnumripper commented 10 years ago

I ran core with 500,000 hashes and 2-30 forks, it didn't miss a beat.

magnumripper commented 10 years ago

Right now I have this without even using fork

$ ./jtrts.pl -noprelims lm
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.12.16, Sept 24, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-bleeding-jumbo [darwin13.4.0 64-bit AVX-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=lm                           guesses: 1500 0:00:00:00 DONE  [PASSED]
.pot CHK:lm                       guesses: 1500 0:00:00:00 DONE  [PASSED]

form=pwdump_lm                    guesses:  646 -show=2385 0:00:00:00 DONE : Expected count(s) (986)(-show2760)  [!!!FAILED!!!]
.pot CHK:pwdump_lm                guesses:  646 0:00:00:00 DONE  [PASSED]

Some tests had Errors. Performed 2 tests.1 errors
Time used was 3 seconds

This is with JtR 0ffc0c68d and TS magnumripper/jtrTestSuite@beadda6f. I'm probably just being dumbforked now? Better not even try to do anything.

EDIT: The above problem was due to a john.local.conf enabling UTF-8 and other things. Another footgun!

frank-dittrich commented 10 years ago

Tests with non-OMP-builds (as well as with OMP_NUM_THREADS=1) suggest that the number of forked processes must be greater than the number of physical CPUs (or at least some of the CPUs must be busy with some other work) to cause failed tests.

frank-dittrich commented 10 years ago

Funny enough, the change that "fixes" issue #798 helps to avoid this bug as well.

$ git diff
diff --git a/run/john.conf b/run/john.conf
index fc8e3f7..4b4799c 100644
--- a/run/john.conf
+++ b/run/john.conf
@@ -156,7 +156,7 @@ SecureMode = N
 # If set to Y, a session using --fork or MPI will signal to other nodes when
 # it has written cracks to the pot file (note that this writing is delayed
 # by buffers and the "Save" timer above), so they will re-sync.
-ReloadAtCrack = Y
+ReloadAtCrack = N

 # If set to Y, resync pot file when saving session.
 ReloadAtSave = Y

magnumripper commented 10 years ago

Yeah but the bug can be seen long before pot reload was implemented at all. So this only changes the playground a little.

frank-dittrich commented 10 years ago

Could other bugs that meanwhile have been fixed have caused this problem before pot reload has been implemented?

magnumripper commented 10 years ago

I think not. I'm pretty sure I already tested the very commit before pot reload was introduced (because it is a "usual suspect") and the bug was there.

frank-dittrich commented 10 years ago

Then may be you did something wrong. Even with ReloadAtCrack = Y, your latest commit 3b4c29f471d05bfa48beda8249d7dfef4c4e2746 fixes that problem for me. But when I checkout 479a6ea7b2b60d28aad1321426a0ad3304a221f0, the bug still exists.

frank-dittrich commented 10 years ago

This bug seems to be related to #798. With latest bleeding-jumbo, I (rarely) get the problem with the number of forked processes close to the number when bug #798 occurs:

 $ (for i in `seq 650 999`; do echo -n "$i "; rm john.pot john.log; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} > /dev/null 2> /dev/null; LC_ALL=C sort -u john.pot | wc -l; done); echo $?
650 1500
651 1500
652 1500
653 1500
654 1500
655 1500
656 1500
657 1500
658 1500
659 1500
660 1500
661 1500
662 1500
663 1499
664 1500
665 1500
666 1500
667 1500
668 1500
669 User defined signal 2
140

In this case, --fork=663 cracked just 1499 hashes.

frank-dittrich commented 10 years ago

At least, john now behaves in a consistent way:

$ (for i in `seq 1 99`; do rm john.pot john.log; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=663 > /dev/null 2> /dev/null; echo -n " `LC_ALL=C sort -u john.pot | wc -l`"; done); echo $?
 1499 1499 1499 1499 1499 1499 1499 1499 1499^C^C

Which reminds me of http://www.despair.com/consistency.html

frank-dittrich commented 10 years ago

With this john.local.conf

[Local:Options]
ReloadAtCrack = N
ReloadAtSave = N

I manage to get less than 1500 unique john.pot lines only if I hit the ulimit of (in my case currently 1024) processes per user. That looks like this:

[fd@f20x32 run]$ (for i in `seq 1020 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt; LC_ALL=C sort -u john.pot | wc -l; done |LC_ALL=C grep -v 1500 ); echo $?
/bin/rm: cannot remove ‘john.pot’: No such file or directory
bash: fork: retry: Resource temporarily unavailable
1020    174
bash: fork: retry: Resource temporarily unavailable
1021    32
1022    53
1023    15
1024    16
0
[fd@f20x32 run]$ grep error t-102*.log
t-1020.log:1 0:00:00:00 Terminating on error, john.c:496
t-1020.log:1 0:00:00:00 Terminating on error, john.c:496
t-1021.log:1 0:00:00:00 Terminating on error, john.c:496
t-1021.log:1 0:00:00:00 Terminating on error, john.c:496
t-1022.log:1 0:00:00:00 Terminating on error, john.c:496
t-1022.log:1 0:00:00:00 Terminating on error, john.c:496
t-1023.log:1 0:00:00:00 Terminating on error, john.c:496
t-1023.log:1 0:00:00:00 Terminating on error, john.c:496
t-1024.log:1 0:00:00:00 Terminating on error, john.c:496
t-1024.log:1 0:00:00:00 Terminating on error, john.c:496

So I guess I'll kep that john.local.conf ans issue a

killall -s USR2 john

Whenever I want a pot file sync.

magnumripper commented 10 years ago

I think we should fix (or at least consider) https://github.com/magnumripper/jtrTestSuite/issues/20 before going on with this.

magnumripper commented 10 years ago

Anyway, I can not reproduce this bug on OSX anymore, even with default john.conf. Except when I go past ulimits.

magnumripper commented 10 years ago

However, I can reproduce it using 9d9c271 (and -mem=1) which was before there was any pot reload or USR2 signalling at all.

magnumripper commented 10 years ago

Can you reproduce this with current code?

frank-dittrich commented 10 years ago

Before Monday I'll not have access to the system where I want to do the re-test.

frank-dittrich commented 10 years ago

$ (for i in `seq 2 1024`; do echo -n -e "$i\t"; echo $i > i.txt; /bin/rm john.pot; ./john --wordlist=../test/pw.dic ../test/LM_tst.in --format=lm --fork=${i} --session=t-$i > stdout-$i.txt 2> stderr-$i.txt ; echo -n -e "$?\t"; LC_ALL=C sort -u john.pot | wc -l; done | LC_ALL=C grep -v 1500 ); echo $?
323 0   1499
325 0   1499
331 0   1499
342 0   1499
346 0   1499
350 0   1499
352 0   1499
356 0   1499
358 0   1499
373 0   1499
374 0   1499
376 0   1499
377 0   1499
379 
0
$ sort -u john.pot | wc -l
1499
$ grep -i error t-3??.log
$ grep -c Cracked t-377.log 
1518
$ grep Cracked t-377.log |cut -d+ -f 2- |sort -u|wc -l 
1499
$ wc -l stdout-377.txt 
1519 stdout-377.txt
$ head stdout-377.txt 
Loaded 1500 password hashes with no different salts (LM [DES 128/128 SSE2])
AB208            (u208-LM:2)
AB585            (u585-LM:2)
AB962            (u962-LM:2)
AB1339           (u1339-LM:2)
AB207            (u207-LM:2)
AB584            (u584-LM:2)
AB961            (u961-LM:2)
AB1338           (u1338-LM:2)
AB216            (u216-LM:2)
$ head -n 4 stderr-377.txt 
Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
Node numbers 1-377 of 377 (fork)
6 4g 0:00:00:00 DONE (2014-11-09 22:42) 12.12g/s 48.48p/s 48.48c/s 72727C/s BERT*ER..IZE__12
3 4g 0:00:00:00 DONE (2014-11-09 22:42) 6.779g/s 27.11p/s 27.11c/s 40677C/s BERT$ER..L3__12
$ grep -c " DONE " stderr-377.txt 
377
$ grep -v -n " DONE " stderr-377.txt 
1:Warning: OpenMP was disabled due to --fork; a non-OpenMP build may be faster
2:Node numbers 1-377 of 377 (fork)
30:Press 'q' or Ctrl-C to abort, almost any other key for status
195:Waiting for 376 children to terminate
382:Warning: passwords printed above might be partial and not be all those cracked
383:Use the "--show" option to display all of the cracked passwords reliably
384:Session completed

So, I need a much higher number of forked processes to reproduce this. But the number of forked processes (> 300, but < 400 on an otherwise idle system)is still way lower than the ulimit -u (1024). I get no indication of errors in log files, in stderr or stdout output or exit code.

magnumripper commented 10 years ago

Trying the same on OSX, it works fine up to 572, and from that point the problem can be seen in the stderr file: "fork: Resource temporarily unavailable". So I can't reproduce any problem with John. That's with an ulimit of 709 max user processes.

openwall / john

LM problem w/ -fork #758