nil0x42 / duplicut

Remove duplicates from MASSIVE wordlist, without sorting it (for dictionary-based password cracking)
GNU General Public License v3.0
870 stars 90 forks source link

Improve `MEDIUM_LINE_BYTES` guessing with heuristic #26

Open nil0x42 opened 4 years ago

nil0x42 commented 4 years ago

MEDIUM_LINE_BYTES is currently hardcorded in const.h, to a value of 8. The hasmap & chunks chunks are then made in such way that if real medium length of lines is MEDIUM_LINE_BYTES, the hashmap will be filled by a factor defined by HMAP_LOAD_FACTOR (currently set to 0.5, for 50% hmap filling).

Therefore, we could read some random pages in the file (e.g: start/middle/end of file), and get a better guess of MEDIUM_LINE_BYTES from there.

It would greatly improve performance in wordlists with a lot of very long lines (for example, a list of md5). Because if lines are 32bytes long, hmap will be filled 12.5% only (50%/2/2). And a lot more chunks are needed.

nil0x42 commented 4 years ago

Count occurrences of newline in buffer (stackoverflow):

Here's the way I'd do it (minimal number of variables needed):

for (i=0; s[i]; s[i]=='.' ? i++ : *s++);