openwall / john

John the Ripper jumbo - advanced offline password cracker, which supports hundreds of hash and cipher types, and runs on many operating systems, CPUs, GPUs, and even some FPGAs
https://www.openwall.com/john/
Other
10.03k stars 2.07k forks source link

Support reading gzipped wordlists (and perhaps other things) #5019

Open magnumripper opened 2 years ago

magnumripper commented 2 years ago

Whenever we do a (much needed) overhaul of wordlist.c, we should consider adding support for reading files compressed with gzip/xz (perhaps others - bzip?) on the fly. Perhaps using dlopen rather than requiring anything at build time.

If we do it cleverly enough (perhaps squeeze it into fgetl() if possible?) we could even support compressed input hash files.

solardiz commented 2 years ago

I think first we need to support DAWG (and include (un)compression programs for it), probably in the same way Crack 5 did (keeping the compressed files as plain text):

https://alecmuffett.com/article/9829

although there are also more extensive binary formats:

http://www.wutka.com/dawg.html

DAWG-compressed wordlists can then be optionally further compressed with gzip, etc. for much better cumulative compression ratio than either method alone.

magnumripper commented 2 years ago

DAWG could be a good way to set the framework for later algorithms. Perhaps we could even support memory-mapping in this thing...

A very early picture of this in my head: john_fopen would detect any compression algorithm (if any), do any dlopen needed and set things up. We could use an extended file struct that contains the algorithm's information and state. It would also keep flags such as no_seek (for stdin or fifos) and limited_seek (for compressed files - we can seek to start of file but not to a restore point). BTW it could also keep track of info for john_ftell to work on a compressed stream.

john_fgetl would transparently get next line.

john_ftell, john_fseek and john_fclose would use our extended file struct and do the right things. Oh, and john_feof.

I guess we should support a file that is DAWG compressed and then compressed with some second algorithm.

magnumripper commented 2 years ago

A very early picture of this in my head (...)

I have now very early code that do the above, with only DAWG implemented. There's no DAWG logic at all in wordlist.c, only the calls were changed from eg. fopen to john_fopen, and so on. All of the new functions are in misc.[hc] as of now.

The mmap logic is left in wordlist.c, I'm not sure if/how to move it to this abstration layer but the concept of that is tempting.

If this experiment is brought to some kind of completion, next question would be whether to wait with a merge until next release or not, but most statistics say we're not going to release any time soon anyway. Anyway, I'll keep playing with this.

typedef struct {
    FILE *stream;           /* Underlying stream */
    uint64_t position;      /* If stream is compressed, this is decompressed position */
    char *description;  /* "file", "FIFO", "compressed file" ... */
    uint8_t no_seek : 1;
    uint8_t some_seek : 1; /* Can seek to start of file */
    uint8_t plain_file : 1;
    uint8_t stdin : 1;
    uint8_t fifo : 1;
    uint8_t compressed : 1;
    uint8_t dawg : 1;
    char last_dawg[LINE_BUFFER_SIZE];
    int last_dawg_len;
} john_FILE;
$ cat dawg.lst 
#!xdawg
0foo
3t
4le
1ubar
3
0grunt
5

$ ../run/john -w:dawg.lst -stdout 
Using default input encoding: UTF-8
foo
foot
footle
fubar
fub
grunt
6p 0:00:00:00 100.00% (2022-01-26 17:27) 150.0p/s grunt
magnumripper commented 2 years ago

I have an idea of a new function in this context: size_t john_filesize(john_FILE *j_file). For normal files, it'd just do the good old "fseek end; ftell" to know the size (but could cache it). For compressed files, it'd return "unknown" until known. After we ran all words through the first rule, it's known and cached - and ETA/progress would start showing correct figures.