-fork may be borking .pot file (and other FILE io) on jumbo.

jfoug commented 9 years ago

Issue found with TS. Started as a stdout only problem, but appears to be .pot file being truncated when -fork is being used.

a fflush() was needed for the stdout problem. That may be the issue needed on all FILE io when in fork mode. Time will tell.

This is an example showing the symptom:

$ ./jtrts.pl -b=/JtR/bleed/run -passthru=-fork=2
-------------------------------------------------------------------------------
- JtR-TestSuite (jtrts). Version 1.13, Dec 21, 2014.  By, Jim Fougeron & others
- Testing:  John the Ripper password cracker, version 1.8.0.2-jumbo-1-bleeding_omp_memdbg_asan [linux-gnu 64-bit SSE4.1-autoconf]
--------------------------------------------------------------------------------

John Jumbo build detected.

form=dynamic_0                    guesses: 1502 0:00:00:00 DONE  [PASSED]
.pot CHK:dynamic_0                guesses: 1500 0:00:00:00 DONE  [PASSED] (1500 val-pwd)

.....
form=dynamic_15                   guesses: 1500 -show= 752 0:00:00:03 DONE : Expected count(s) (1500)  [!!!FAILED!!!]
.pot CHK:dynamic_15               guesses:  752 -show= 752 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]  (752 val-pwd)

In this case, the -show only gave us 752, which was what was in the .pot file. Upon the .pot check, it also shows us that only 752 lines were in the .pot, and on re-run john worked with them, so this does not appear to be any 'bug' in the test suite. This only appears to happen in -fork mode.

jfoug commented 9 years ago

Ok, I ran again, with -stop on the command line. The .pot was truncated to 1049 entries, a wc on the .pot file shows 1050 lines, and I found this in the .pot file (showing why only 1049)

$dynamic_9$6f09fa563ac911fdea73cda5318c7b79$zIOqOeIk:}Ry4ppppppppppYP; 
$dynamic_9$b1ccabc84972f4295be16a978fef0f21$xshNSrtz:*eX8ppppppppppk|E@
$dynamic_9$c37fdaf2cd8c82f9bd208a28f0dd186c$8ut3rtKa:(_Fxpppppppppp@CQK
$dynamic_9$6b4822f6e523545b2becb4fe11c8f72d$cWtOo79x:#ww2mlkn
Hrgnu:I(e|ppppppppppppppppppppppppppppppppppppppppppppppv$@.
$dynamic_9$26a4c805473dd391c68ec905430e61f4$ClCS6YXL:1_@=pppppppppppppppppppppppA60R
$dynamic_9$38fc889fb3fc56d00890ea62d3974dcb$NTGdWyLr:jZX&pppppppppppppppppppppIs.8

That is the same 'classic' corrupted file sign we were seeing on the stdout problem fixed by a fflush() call after printing the found guess.

jfoug commented 9 years ago

A process may only hold one type of lock (shared     or  exclusive)  on  a
file.   Subsequent flock() calls on an already locked file will convert
an existing lock to the new lock mode.

We do have flock(pot_fd, LOCK_SH) within cracker??? But that is ONLY during reload pot.

COULD that be the problem???? In the TS, things happen fast. It could be that one process has already written into the .pot file, the other starts up, and blows the lock type by trying to load the .pot file??? Just a theory, since flock is a new function to me. The function make sense, since it is simply a mutex like function @magnumripper, is there any reason why we are using LOCK_SH in the flock within the .pot file reload? I would think exclusive would be just fine there. It would mean john blocks allowing only one process to access the .pot file at once during startup, BUT that is really the way it should be, on a multi-processing system, especially by reading the man page, listing that a call of a shared lock will pretty much zonk all of the exclusives being waited upon.

jfoug commented 9 years ago

Changing the flock in cracker to LOCK_EX did not change the problem. I have kept it for now at LOCK_SH

jfoug commented 9 years ago

Ok, here is a decent looking example:

Execute john: /JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_15_new_tst.in --wordlist=pw-new.dic 2>&1 >/dev/null

Cmd_data = 
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
1 754g 0:00:00:02 DONE (2015-02-05 00:22) 375.1g/s 966.6p/s 1450Kc/s 1450KC/s aab2b..DSr]pppppppppO(C\
Waiting for 1 child to terminate
2 746g 0:00:00:02 DONE (2015-02-05 00:22) 365.6g/s 952.4p/s 1428Kc/s 1428KC/s ;C>5ppt^q^..ko}GpppVeLj
At Program Exit
MemDbg_Validate level 0 checking Passed
Use the "--show" option to display all of the cracked passwords reliably
Session completed
At Program Exit
MemDbg_Validate level 0 checking Passed

Execute john: /JtR/bleed/run/john -show    -pot=./tst.pot dynamic_15_new_tst.in -form=dynamic_15

d_show_line = 
747 password hashes cracked, 753 left

form=dynamic_15                   guesses: 1500 -show= 747 0:00:00:02 DONE : Expected count(s) (1500)  [!!!FAILED!!!]
Exiting on error. The .pot file ./tst.pot contains the found data
The command used to run this test was:

/JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_15_new_tst.in --wordlist=pw-new.dic

$ wc tst.pot
  748   806 97257 tst.pot

Then at very bottom of tst.pot file:

$dynamic_15$c2f97f7d9423b3eb8938caad5e044519$HEX$4a3438494752306f2424555477656c76655f6368617273:!\zMppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp#Af0
$dynamic_15$af99574f198b6b4a5374d7ec932912f1$HEX$325278467a6176672424557531:es>YpppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppgI`X
$dynamic_15$69dd3eb59597ffa1996b9bd4a314cf9e$HEX$454f6c50786b455324245575736572:d(`fpppppY.zm
pcZH<E7>
$dynamic_15$fced19c087c1b8b27ffd192da415d77c$HEX$4438643544336a3124245548616e6b:Y7LCpppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp&<wg
(END)

So what appears to have happened here, is that the first 754 line crack wrote data. Then the 746 line fork wrote data right over the top, starting at tope of file. There was a bit more data (1 full line, and a partial) that was longer in the file so it still shows up.

I am also surprised about 754/746 counts. Usually it is 750/750 for fork.

jfoug commented 9 years ago

This one is not so easy to determine reason why (it is 100 lines 'short' in the .pot file)

Execute john: /JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_1_new_tst.in --wordlist=pw-new.dic 2>&1 >/dev/null

Cmd_data = 
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
2 460g 0:00:00:01 DONE (2015-02-05 00:38) 310.8g/s 1312p/s 1969Kc/s 1969KC/s ;C>5ppt^q^..ko}GpppVeLj
At Program Exit
MemDbg_Validate level 0 checking Passed
1 463g 0:00:00:01 DONE (2015-02-05 00:38) 300.6g/s 1261p/s 1892Kc/s 1892KC/s aab2b..DSr]pppppppppO(C\
Waiting for 1 child to terminate
Use the "--show" option to display all of the cracked passwords reliably
Session completed
At Program Exit
MemDbg_Validate level 0 checking Passed

Execute john: /JtR/bleed/run/john -show    -pot=./tst.pot dynamic_1_new_tst.in -form=dynamic_1

d_show_line = 
823 password hashes cracked, 677 left

form=dynamic_1                    guesses:  923 -show= 823 0:00:00:01 DONE : Expected count(s) (1500)(923)  [!!!FAILED!!!]
Exiting on error. The .pot file ./tst.pot contains the found data
The command used to run this test was:

/JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_1_new_tst.in --wordlist=pw-new.dic

$ wc tst.pot
  824   874 74861 tst.pot

magnumripper commented 9 years ago

I set LOCK_SH for reading, LOCK_EX for writing.

What good would a LOCK_EX do in pot sync? Pot sync only reads so it uses LOCK_SH. Other processes can set a LOCK_SH simultaneously and sync too, without waiting. That's the idea of a shared lock.

Any process wanting LOCK_EX for writing new pot entries will obviously wait for those LOCK_SH before it gets its lock.

magnumripper commented 9 years ago

A process may only hold one type of lock (shared     or  exclusive)  on  a
file.   Subsequent flock() calls on an already locked file will convert
an existing lock to the new lock mode.

Only process A (or perhaps a subthread of process A but we don't have threads doing file IO) can blow its own lock the way you describe. Otherwise we could give up locking completely.

It could be that one process has already written into the .pot file, the other starts up, and blows the lock type by trying to load the .pot file?

If process A has a lock, process B can not blow it.

jfoug commented 9 years ago

Yes, but a LOCK_SH will blow away (change) any lock waiting in LOCK_EX. (edit, so that is only threads in the process?? If so, then i would 'assume' this would be ok. It is outside of the OMP loop, correct)

I have a printf in the reload_pot, and it is not being called anyway.

Here is some new information. I added some logging to the flush pot stuff. It looks like all data in the flush made it out, and to the 'proper' location. Could this bug be something totally different, in that the last process sometimes does not write out it's final data???

Warning: detected hash type "dynamic_4", but the string is also recognized as "osc"
Use the "--format=osc" option to force loading these as that type instead
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
 ***In log_file_flush, writing to offset 0, count=28536 total after write=28536
 ***In log_file_flush, reset crk_pot_pos to 0
 ***In log_file_flush, writing to offset 28536, count=28559 total after write=57095
2 458g 0:00:00:00 DONE (2015-02-05 01:06) 1832g/s 1832p/s 1483Kc/s 1678KC/s )|jMp�Gge..s7g9y)
1 458g 0:00:00:00 DONE (2015-02-05 01:06) 1761g/s 1761p/s 1426Kc/s 1613KC/s \y^>pppppppppppppp~!OV..w&K4pppp{A#b
Waiting for 1 child to terminate
At Program Exit
MemDbg_Validate level 0 checking Passed
Use the "--show" option to display all of the cracked passwords reliably
Session completed
At Program Exit
MemDbg_Validate level 0 checking Passed

FAILED line = guesses: 458  0:00:00:00 DONE (2015-02-05 01:06) 1832g/s 1832p/s 1483Kc/s 1678KC/s )|jMp�Gge..s7g9y)
Execute john: /JtR/bleed/run/john -show    -pot=./tst.pot dynamic_4_new_tst.in -form=dynamic_4 2>&1

d_show_line2 = 
916 password hashes cracked, 584 left
1 invalid passwords
916 valid passwords

.pot CHK:dynamic_4                guesses:  916 -show= 916 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]  (916 val-pwd)
Exiting on error. The .pot file ./tst.pot contains the found data
The command used to run this test was:

/JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_4_new_tst.in --wordlist=pw-new.dic

$ ls -l tst.pot
-rwxrwx--- 1 root vboxsf 57095 Feb  5 01:06 tst.pot

jfoug commented 9 years ago

This was what I was hoping to see (at some time). This shows that for whatever reason, the locking is not always working:

Execute john: /JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_13_tst.in --wordlist=pw.dic 2>&1 >/dev/null

Cmd_data = 
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
 ***In log_file_flush, writing to offset 0, count=32834 total after write=32834
 ***In log_file_flush, reset crk_pot_pos to 0
 ***In log_file_flush, writing to offset 0, count=32803 total after write=32803
 ***In log_file_flush, reset crk_pot_pos to 0
 ***In log_file_flush, writing to offset 32834, count=16950 total after write=49784
 ***In log_file_flush, reset crk_pot_pos to 32834
1 752g 0:00:00:04 DONE (2015-02-05 01:17) 170.9g/s 667.0p/s 1000Kc/s 1000KC/s Skippin� an�*..0A234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
Waiting for 1 child to terminate
 ***In log_file_flush, writing to offset 32803, count=16843 total after write=49646
 ***In log_file_flush, reset crk_pot_pos to 32803
2 748g 0:00:00:04 DONE (2015-02-05 01:17) 163.6g/s 642.0p/s 963019c/s 963019C/s Skipping and*..Z1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
At Program Exit
MemDbg_Validate level 0 checking Passed
Use the "--show" option to display all of the cracked passwords reliably
Session completed
At Program Exit
MemDbg_Validate level 0 checking Passed

Execute john: /JtR/bleed/run/john -show    -pot=./tst.pot dynamic_13_tst.in -form=dynamic_13

d_show_line = 
750 password hashes cracked, 750 left

form=dynamic_13                   guesses: 1500 -show= 750 0:00:00:04 DONE : Expected count(s) (1500)  [!!!FAILED!!!]
Exiting on error. The .pot file ./tst.pot contains the found data
The command used to run this test was:

/JtR/bleed/run/john -ses=./tst  -fork=2 -pot=./tst.pot dynamic_13_tst.in --wordlist=pw.dic

magnumripper commented 9 years ago

Could this bug be something totally different, in that the last process sometimes does not write out it's final data???

That sounds more plausible to me. So why would that happen? Are all processes really completing? I recently changed a bunch of exit() into error() because that's a very much better way to abort a job.

jfoug commented 9 years ago

The above looks ALMOST like 2 totally separate file descriptors, but acting on the same file.

jfoug commented 9 years ago

This is the code I am using

#if OS_FLOCK
    while (flock(f->fd, LOCK_EX)) {
        if (errno != EINTR)
            pexit("flock(LOCK_EX)");
    }
#endif
    if (f == &pot) {
        pos_b4 = (long int)lseek(f->fd, 0, SEEK_END);
+       if (options.fork) {
+           fprintf (stderr, " ***In log_file_flush, writing to offset %ld, count=%d total after write=%ld\n", pos_b4, count, pos_b4+count);
+       }
    }
    if (write_loop(f->fd, f->buffer, count) < 0) pexit("write");
    f->ptr = f->buffer;

    if (f == &pot && pos_b4 == crk_pot_pos) {
        crk_pot_pos = (long int)lseek(f->fd, 0, SEEK_CUR);
+       fprintf (stderr, " ***In log_file_flush, reset crk_pot_pos to %ld\n", pos_b4, count);
    }
#if OS_FLOCK
    if (flock(f->fd, LOCK_UN))
        pexit("flock(LOCK_UN)");
#endif

magnumripper commented 9 years ago

After a fork() any file descriptors are like duplicated. In order for locking to work we need to (re)open the pot and log files after forking. But that I'm pretty sure we do.

magnumripper commented 9 years ago

Add a debug print showing you indeed execute code within that #if OS_FLOCK block. That goes for pot sync too.

 #if OS_FLOCK
     while (flock(f->fd, LOCK_EX)) {
         if (errno != EINTR)
             pexit("flock(LOCK_EX)");
     }
+    fprintf(stderr, "successful LOCK_EX\n");
 #endif

jfoug commented 9 years ago

I agree. looking at the "***In log_file_flush..." messages, it looks like almost always this is hte case.

But it is HARD to explain away the output I show a few messages up. There are 2 distinct instances, as shown by the lseek(end) and the counts printed. BUT they are writing to the same file.

I am going to make a simple script that gets the TS out of the way, that will simply hammer away on a known fast input item until the .pot file comes up short. All data will be logged to a file, and the temp stuff cleaned up between each run (if things are good). Should be pretty easy to do, and once done, will be a real quick way to get all the 'testing' we need.

jfoug commented 9 years ago

Add a debug print showing

Will do.

NOTE, I have debug output for pot reload, and that code is never executed. Is that the -reload-pot stuff (where you use the .pot as the input file?) If so, it is nowhere near the exe path we are using anyway.

magnumripper commented 9 years ago

That pot reload is executed on signals and on timers. For short jobs, it's often called very late if at all.

jfoug commented 9 years ago

I have not seen it called 'yet'. I certainly have not seen it called on the ones that have been failing, so for now, it is not on my radar.

magnumripper commented 9 years ago

It only reads so it shouldn't be able to do much harm. Actually it could probably do fine without locking at all.

jfoug commented 9 years ago

Ok, new script:

cat flock-tst 

#!/bin/sh
rm bug.*
/JtR/bleed/run/john -fork=2 dynamic_1_tst.in -w=pw.dic -pot=./bug.pot -ses=./bug -form=dynamic_1 >./bug.stdout 2>./bug.stderr
CNT=$(wc bug.pot | awk '{print $1;}')
if [ x$CNT != x1500 ] ; 
then
    echo failed
fi

Ran this a couple times, til I got a failed.

Here are the relevant lines:

$ wc bug.pot
  1131   1202 100803 bug.pot
$ wc bug.stdout
 1195  2461 41377 bug.stdout
$ cat bug.stderr
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
successful LOCK_EX
successful LOCK_EX
successful LOCK_EX
successful LOCK_EX
Press 'q' or Ctrl-C to abort, almost any other key for status
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32844 total after write=32844
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32800 total after write=32800
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
successful LOCK_EX
 ***In log_file_flush, writing to offset 32800, count=32830 total after write=65630
 ***In log_file_flush, reset crk_pot_pos to 32800
successful LOCK_EX
successful LOCK_EX
successful LOCK_EX
 ***In log_file_flush, writing to offset 65630, count=1607 total after write=67237
 ***In log_file_flush, reset crk_pot_pos to 65630
1 755g 0:00:00:02 DONE (2015-02-05 01:46) 280.6g/s 1091p/s 1636Kc/s 1636KC/s Skippin� an�*..0A234567890123456789012
Waiting for 1 child to terminate
successful LOCK_EX
successful LOCK_EX
 ***In log_file_flush, writing to offset 32844, count=32774 total after write=65618
 ***In log_file_flush, reset crk_pot_pos to 32844
successful LOCK_EX
successful LOCK_EX
successful LOCK_EX
 ***In log_file_flush, writing to offset 65618, count=792 total after write=66410
 ***In log_file_flush, reset crk_pot_pos to 65618
2 745g 0:00:00:02 DONE (2015-02-05 01:46) 272.8g/s 1074p/s 1612Kc/s 1612KC/s Skipping and*..Z1234567890123456789012
successful LOCK_EX
At Program Exit
MemDbg_Validate level 0 checking Passed
Use the "--show" option to display all of the cracked passwords reliably
Session completed
successful LOCK_EX
At Program Exit
MemDbg_Validate level 0 checking Passed

I am done for the night. I have to sleep sometime.

jfoug commented 9 years ago

Btw, here is a 'good' test of the flock-tst script:

$ ./flock-tst
$ wc bug.*
  1520   7653  58346 bug.log
  1500   1595 133647 bug.pot
    38    194   1535 bug.stderr
  1156   2357  39695 bug.stdout
  4214  11799 233223 total

NOTE, I am not sure why bug.stdout is short. I would think it was 1500+ also.

NOTE the stdout is getting overwrites ??? I thought that was fixed.

jfoug commented 9 years ago

here is a bug.stderr for one that worked

$ cat bug.stderr
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32771 total after write=32771
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
 ***In log_file_flush, writing to offset 32771, count=32835 total after write=65606
successful LOCK_EX
 ***In log_file_flush, writing to offset 65606, count=32788 total after write=98394
successful LOCK_EX
 ***In log_file_flush, writing to offset 98394, count=1614 total after write=100008
successful LOCK_EX
 ***In log_file_flush, writing to offset 100008, count=32844 total after write=132852
1 755g 0:00:00:02 DONE (2015-02-05 02:01) 275.5g/s 1071p/s 1606Kc/s 1606KC/s Skippin� an�*..0A234567890123456789012
Waiting for 1 child to terminate
successful LOCK_EX
 ***In log_file_flush, writing to offset 132852, count=795 total after write=133647
2 745g 0:00:00:02 DONE (2015-02-05 02:01) 267.9g/s 1055p/s 1583Kc/s 1583KC/s Skipping and*..Z1234567890123456789012
At Program Exit

Updated (I only output successful FLOCK_EX, if it is a pot file)

jfoug commented 9 years ago

Here is a bug.stderr from a failure (FLOCK_EX only printed for pot file locks)

$ cat bug.stderr
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32841 total after write=32841
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32794 total after write=32794
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
 ***In log_file_flush, writing to offset 32794, count=32841 total after write=65635
 ***In log_file_flush, reset crk_pot_pos to 32794
successful LOCK_EX
 ***In log_file_flush, writing to offset 65635, count=1602 total after write=67237
 ***In log_file_flush, reset crk_pot_pos to 65635
successful LOCK_EX
 ***In log_file_flush, writing to offset 32841, count=32841 total after write=65682
 ***In log_file_flush, reset crk_pot_pos to 32841
1 755g 0:00:00:03 DONE (2015-02-05 02:05) 251.6g/s 978.3p/s 1467Kc/s 1467KC/s Skippin� an�*..0A234567890123456789012
Waiting for 1 child to terminate
successful LOCK_EX
 ***In log_file_flush, writing to offset 65682, count=728 total after write=66410
 ***In log_file_flush, reset crk_pot_pos to 65682
2 745g 0:00:00:03 DONE (2015-02-05 02:05) 245.0g/s 965.1p/s 1447Kc/s 1447KC/s Skipping and*..Z1234567890123456789012
At Program Exit

And some info about the .pot file:

$ ls -l bug.pot
-rwxrwx--- 1 root vboxsf 100078 Feb  5 02:05 bug.pot
$ wc bug.pot
  1124   1192 100078 bug.pot

frank-dittrich commented 9 years ago

I guess it is related to these john.conf settings:

# If set to Y, a session using --fork or MPI will signal to other nodes when
# it has written cracks to the pot file (note that this writing is delayed
# by buffers and the "Save" timer above), so they will re-sync.
ReloadAtCrack = Y

# If set to Y, resync pot file when saving session.
ReloadAtSave = Y

This also explains why it doesn't happen in core.

magnumripper commented 9 years ago

I ran these tests over and over again on Linux as well as OSX and can't reproduce. Apparently your system's flock() returns OK but does not lock the file. What system is this? What filesystem are you on?

With 6a749a162 and -DLOCK_DEBUG and running a 1000x larger test I can see the processes getting in line for locking, and waiting for each other.

(Here's running without that debug)

$ perl -ne 'print "\$0\$" . $_ if m/^[a-z0-9]{1,125}$/' < rockyou.lst >rockyou.plaintext

$ wc -l rockyou.plaintext
12147723 rockyou.plaintext

$ rm -f bug.* && ../run/john -fork=2 rockyou.plaintext -w:rockyou.lst -pot=bug.pot -ses=bug -form=plaintext -verb=1 ; wc -l bug.pot
Loaded 12147723 password hashes with no different salts (plaintext, $0$ [n/a])
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
1 6072671g 0:00:00:20 DONE (2015-02-05 17:15) 295650g/s 349181p/s 349181c/s 2454GC/s bicuspids1..Vk122503
Waiting for 1 child to terminate
2 6075052g 0:00:00:20 DONE (2015-02-05 17:15) 295766g/s 349181p/s 349181c/s 2383GC/s 651476..1450500134167
Session completed
12147723 bug.pot

Also tried fork=4 and higher. Never any problem. I also tried the above with "Save = 1" in john.conf. This means pot-sync will infere a whole lot more with the process. It still did not miss a beat.

magnumripper commented 9 years ago

Latest commit adds a little test script, unused/locktest.sh. It uses flock(1) which is a command for using flock(2). flock(1) is available on Linux, OpenBSD and some others. For OSX I used this one. Try it!

Here's "good" output:

$ unused/locktest.sh 
File initially not locked.
Locking file exclusively, using flock(2) for 5 seconds.
Confirmed, file is locked. Waiting for shared lock...
Got it, now releasing.

magnumripper commented 9 years ago

successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32841 total after write=32841
 ***In log_file_flush, reset crk_pot_pos to 0
successful LOCK_EX
 ***In log_file_flush, writing to offset 0, count=32794 total after write=32794
 ***In log_file_flush, reset crk_pot_pos to 0

Try latest commit with -DLOCK_DEBUG and re-run until you get the same situation. Does the first process release the lock before 2nd process get a "successful" one?

jfoug commented 9 years ago

What system is this? What filesystem are you on?

The system is a win8.1 laptop, running virtualbox VM, with Ubuntu 64 running on it. The actual file system is:

C_DRIVE on /media/sf_C_DRIVE type vboxsf (gid=999,rw)

This is an NTFS drive mounted by virutal box. Now, I can easily move the TS to this mount (on the virtual box VM)

/dev/sda1 on / type ext4 (rw,errors=remount-ro)

but I have no idea what a ext4 type means.

I will rebuild with latest locking and see if it is better, or show the problem.

jfoug commented 9 years ago

Here is the 'first' failure.

Cmd_data = 
Will run 2 OpenMP threads per process (4 total across 2 processes)
Node numbers 1-2 of 2 (fork)
Press 'q' or Ctrl-C to abort, almost any other key for status
log_file_flush(2): Successfully locked ./tst.pot LOCK_EX
log_file_flush(2): writing 32769 at 0, ending at 32769
log_file_flush(2): set crk_pot_pos to 32769
log_file_flush(2): Unlocked ./tst.pot
log_file_flush(1): Successfully locked ./tst.pot LOCK_EX
log_file_flush(1): writing 32794 at 0, ending at 32794
log_file_flush(1): set crk_pot_pos to 32794
log_file_flush(1): Unlocked ./tst.pot
log_file_flush(2): Successfully locked ./tst.pot LOCK_EX
log_file_flush(2): writing 16129 at 32769, ending at 48898
log_file_flush(2): set crk_pot_pos to 48898
log_file_flush(2): Unlocked ./tst.pot
2 748g 0:00:00:02 DONE (2015-02-05 12:41) 283.3g/s 1111p/s 1667Kc/s 1667KC/s Skipping and*..Z123456789012345678901234567890
At Program Exit
MemDbg_Validate level 0 checking Passed
log_file_flush(1): Successfully locked ./tst.pot LOCK_EX
log_file_flush(1): writing 16238 at 32794, ending at 49032
log_file_flush(1): set crk_pot_pos to 49032
log_file_flush(1): Unlocked ./tst.pot
1 752g 0:00:00:02 DONE (2015-02-05 12:41) 280.5g/s 1095p/s 1642Kc/s 1642KC/s Skippin� an�*..0A23456789012345678901234567890
Waiting for 1 child to terminate
Use the "--show" option to display all of the cracked passwords reliably
Session completed
At Program Exit
MemDbg_Validate level 0 checking Passed

$ ls -l tst.pot
-rwxrwx--- 1 root vboxsf 49032 Feb  5 12:41 tst.pot
$ wc tst.pot
  752   816 49032 tst.pot

magnumripper commented 9 years ago

log_file_flush(2): Successfully locked ./tst.pot LOCK_EX
log_file_flush(2): writing 32769 at 0, ending at 32769
log_file_flush(2): set crk_pot_pos to 32769
log_file_flush(2): Unlocked ./tst.pot
log_file_flush(1): Successfully locked ./tst.pot LOCK_EX
log_file_flush(1): writing 32794 at 0, ending at 32794
log_file_flush(1): set crk_pot_pos to 32794
log_file_flush(1): Unlocked ./tst.pot

OK, so unless these fprint's to unbuffered stderr happens to end up in some other order than they REALLY happened, all things happen in correct order and YET process 1 get the wrong figure!? I have absolutely no idea what is going on but I write it off as your virtual environment having serious problems until you can show this happening on real gear.

ext4 is the current standard Linux filesystem, it does support locking.

magnumripper commented 9 years ago

Do try that locktest.sh script!

jfoug commented 9 years ago

I did try locktest. It was happy to work right on my vboxfs fs. I did run it multiple times, and it was happy to say it was working.

I will copy all the TS stuff to the ext4 system and test there.

magnumripper commented 9 years ago

On a side note I did heavy testing of these things all day today and I did find some interesting (frightening) facts. For example, a bog standard NFS share with Linux server and Linux clients: Client could lock, other clients would see the lock. But the server's native filesystem (at "same" directory) would NOT see clients' locks nor share its locks with the clients. This means you can run over NFS (required for MPI) just fine but you must NOT run any JtR process on the NFS server itself.

magnumripper commented 9 years ago

The fact locktest.sh worked fine indicates we have some other problem than we thought. I just can't fathom what it could be. It's kind of impossible... well, unless it's file caching inconsistency!?

magnumripper commented 9 years ago

You could try this just for... trying it :-P

diff --git a/src/logger.c b/src/logger.c
index 719106d..a528257 100644
--- a/src/logger.c
+++ b/src/logger.c
@@ -133,6 +133,10 @@ static void log_file_flush(struct log_file *f)
        fprintf(stderr, "%s(%u): set crk_pot_pos to %ld\n", __FUNCTION__, options.node_min, crk_pot_pos);
 #endif
    }
+   if (f == &pot) {
+       fprintf(stderr, "%s(%u) fsync(%s)\n", __FUNCTION__, options.node_min, f->name);
+       if (fsync(f->fd)) pexit("fsync");
+   }
 #if OS_FLOCK
    if (flock(f->fd, LOCK_UN))
        pexit("flock(LOCK_UN)");

jfoug commented 9 years ago

You could try this just for... trying it :-P

Well, using -v -v -stop on the vboxfs, it made it past dyna_16 (2 files or more for each dyna). Before it was not making it nearly that far before dieing. I may get rid of the -v -v and simply run and see if the problem is gone. (Note, I restarted, since I was not 100% sure I used -stop on that run).

Damn, I took -v -v out, and it stopped almost instantly. But it does seem to vary per run. Some run with almost no errors for a while, then 'get into' them. Others simply start having every 3 to 5th run be an error.

I will also test more on the ext4 fs.

Running -v -v 'may' make it just a bit harder to trigger, since the screen IS being updated more. However, I thought the screen was only updated when jtr was not running. It may be that giving it a few extra 100 ms between each run was helping to 'calm' things down. But now I have another run without -v -v going, and am up past dynamic_12 (doubled) So I had 2 bad runs in a row, and now it is working 'better'. I will let it go and see.

Nope, died

form=dynamic_13                   guesses: 1500 0:00:00:03 DONE  [PASSED]
.pot CHK:dynamic_13               guesses: 1500 0:00:00:01 DONE  [PASSED] (1500 val-pwd)

form=dynamic_14                   guesses: 1500 0:00:00:02 DONE  [PASSED]
.pot CHK:dynamic_14               guesses: 1500 0:00:00:00 DONE  [PASSED] (1500 val-pwd)

form=dynamic_14                   guesses: 1237 0:00:00:01 DONE  [PASSED]
.pot CHK:dynamic_14               guesses: 1237 -show=1237 0:00:00:00 DONE : Expected count(s) (1500)  [!!!FAILED!!!]  (1237 val-pwd)
Exiting on error. The .pot file ./tst.pot contains the found data
The command used to run this test was:

magnumripper commented 9 years ago

That fsync() is not going to be committed though, unless you can come up with a sensible #ifdef to surround it with. We are chasing issues with a virtual environment as far as I can tell.

jfoug commented 9 years ago

Yes, I think so also, BUT it would certainly benefit us to know 'why'. Finding out why may allow things like cygwin (which has a much larger audience than someone using a VM like this). Cygwin has fork, and it 'claims' usable, but things go to shit, and I think this may be the reason.

NOTE, on this vm, I am running on that ext4 main fs, and things are running fine. There was an error on my 'new' dyna_25 run, but the numbers were right. I am not sure the error, it could be some line 'looking' like a pw, but not being one (i.e. ending with ) and having a " (" string in it). I hope that dyna_25 if run by itself will show the problem (in -v -v mode).

But as far as I can see now, the normal non NFS type fs is working at least much better right now. It is up to dyna-77 (doubled).

magnumripper commented 9 years ago

Did you try running on the ext4 fs without the extra fsync()? I would guess it works fine without it.

On another note you should try compiling that flock(1) utility and see how locktest.sh manages under Cygwin, on a normal "cygwin drive". I guess it will run just fine but it would be good to know it does.

jfoug commented 9 years ago

flock is a command already built into cygwin. (at least cygwin64)

$ ./locktest.sh File initially not locked. Locking file exclusively, using flock(2) for 5 seconds. Confirmed, file is locked. Waiting for shared lock... Got it, now releasing.

It has no problem. I will run it multiple times. NOTE, that example does not actually write anything, so just because a file 'gets' locked, does not seem to be enough to KNOW that multiple writes between processes will be done in a synchronous manner. We are seeing that with the vboxfs file system (or possibly other NFS type mounts).

It would be nice to have a stress test tool that would tell someone quickly (10-20 minutes at most), that some setup is NOT running properly to use john --fork=xx on, vs having them lose a shit load of work. That is why in cygwin we FORCED fork to fail. It was (is or appears) to be a known system where it simply fails.

magnumripper commented 9 years ago

NOTE, that example does not actually write anything

True, it only checks locking. Testing cache coherence is much harder and probably can't be done at all with a simple script like that. OTOH I would not expect Cygwin to have any file cache problems that NTFS in general doesn't have - and I don't know of any.

jfoug commented 9 years ago

I do not either, BUT keep this in mind, the vboxfs has it's real parent being an NTFS file system. vboxfs is simply visualization to allow unix to 'think' it can use it natively, but I would be 90% under the hood passes through to NTFS. That along with Cygwin having known bad forking usage under JtR leads me to believe that possibly NTFS does not cache/lock the way we 'think' it should.

In many of the bad issues we are seeing now, it almost appears that the OS is NOT associating file handles between the 2 processes against the same file as BEING associated with same file (so that locking happens). If you look at the 'bad' runs, each process locks it's own file, starts at 0, writes bytes, then writes more to the end (which was what it wrote the first time), etc, etc. BUT there is only 1 file. So for -fork=2, I am seeing a lot of 748/751/753 type runs (just about half, and depends upon who wrote first). Each are seeing their own file handle, and NOT associating it with a share file between processes. So, I am not sold that caching is the problem. I really do not know enough of how this is really implemented to know how it is supposed to work under the hood.

jfoug commented 9 years ago

Question:

what if the lock was obtained on the 2nd process PRIOR to the file itself actually being created by the first???? Created might mean created and the inode being flushed to disk. In this case, both processes 'might' think they are the sole owner. Again, just a theory, I do not know shit about under the hood plumbing of this crap. But if that is the case, I wonder if we could do lock, if not exist create, close lock, and regain a lock?? I think that would ONLY have to happen if the file does not yet exist, so should be a zero impact on speed.

One bad thing, is I have NOT been able to get core to fail against the vboxfs (when run on the ubuntu VM). BUT there are not many formats there either. I did put a fflush in, and core works like a champ for the new (1500 val-pwd) testing (i.e. the stdout bug).

magnumripper commented 9 years ago

The file is created in log_file_init(), way before this happens. I'm pretty sure this is about cache coherence and I'm not very surprised - unless I'm confused that file system is really not meant for things like this. It's mostly meant for easy file transfer between host and vm, no?

On the other hand - if it's NOT about cache coherence, it looks like a really bad bug in vbox.

magnumripper commented 9 years ago

The fact you can't reproduce this with core John bugs me. I think you should concentrate on that. Try thousands of dummy hashes and verify that you see this problem with Jumbo. Then do the exact same using core.

jfoug commented 9 years ago

I made a simple change in log_file_init() and it 'may' have fixed things

this is the chagne:

    if ((f->fd = open(path_expand(name),
        O_WRONLY | O_CREAT | O_APPEND, S_IRUSR | S_IWUSR)) < 0)
        pexit("open: %s", path_expand(name));

+   close(f->fd);
+   if ((f->fd = open(path_expand(name),
+       O_WRONLY | O_CREAT | O_APPEND, S_IRUSR | S_IWUSR)) < 0)
+       pexit("open: %s", path_expand(name));

    f->ptr = f->buffer = mem_alloc(size + LINE_BUFFER_SIZE);

This simply means that the file pointer being used is NEVER one that actually created the file. I think there is caching going on, in that we are NOT locked, and the OS is caching the acutal flush out of the inode is being delayed.

NOPE, at dyna-29, I got a failure (614 vs 1229, so this is amost exactly half, and is 2 processes both thinking they are the only ones).

magnumripper commented 9 years ago

The fact you can't reproduce this with core John bugs me. I think you should concentrate on that. Try thousands of dummy hashes and verify that you see this problem with Jumbo. Then do the exact same using core.

Oh, and leave TS out of it, it just adds confusion. Do something similar to https://github.com/magnumripper/JohnTheRipper/issues/1043#issuecomment-73090668 but with dummy instead of plaintext.

jfoug commented 9 years ago

The fact you can't reproduce this with core John bugs me. I think you should concentrate on that.

One huge thing, is in jumbo, we fdopen (get a STDIO buffered handle) of the .pot file. In core, that file is WRITE ONLY, and never gets opened by anything other than a raw file handle.

But as far as I could tell, that pot reload code (where the FILE * handle is created/used) was not being called. I might have been wrong on that, but that was what I thought.

jfoug commented 9 years ago

With the open/close/open in there, it is much harder to get a failure. I am still working on jumbo (1000 dummies). I think more is not going to show this problem any better. I think fewer is 'better'.

magnumripper commented 9 years ago

What if you fsync() right after open instead of close-reopen?

openwall / john

-fork may be borking .pot file (and other FILE io) on jumbo. #1043