import_passwords Slow with large word lists

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. pyrit -f wordlist.txt import_passwords

What version of the product are you using? On what operating system?
svn 115 with Fedora 11 x86_64

Please provide any additional information below.
The word list I am using is 24gb big with approximately 2,400,000,000 lines
in it. It has taken at least 20 hours to import 2,283,200,000 lines. At
that rate it has done, 31,711 lines a second. The code seems to be single
threaded, in that I only see 99% cpu usage in top instead of 199%.

It is running on a Core2Duo 1.86ghz with 4gb of DDR2 memory. Not the
fastest thing in the world. The storage is a linux software raid5 array of
four 250gb hard drives. The load is 1.04, so nothing else is eating a
significant amount of cpu. 

I know this only has to be done once, but 20+ hours of overhead to use
batchprocess sucks. It could easily be a lot worse if I used an even bigger
wordlist. If I added all US telephone numbers, 10 digit numbers, it would
add another 40gb.

The code seems like a candidate for a C library.

Original issue reported on code.google.com by starhe...@gmail.com on 12 Jul 2009 at 7:15

GoogleCodeExporter commented 9 years ago

Please try again with r120. The PasswordStore-code was borked in previous 
svn-versions.

Some of the storage code is indeed going to get C'ed...

Original comment by lukas.l...@gmail.com on 12 Jul 2009 at 9:00

GoogleCodeExporter commented 9 years ago

It is better, at least in the beginning. I am seeing rates of "178,082", 
"154,887",
and "189,570". I am going to leave it running, and recalculate the rate when it 
is
done. It seems to pause for a few seconds at a time every so often, and may 
slow down
further into the list.

I noticed after I let the previous run finish that the passwords directory was 
only
12gb instead of 24gb. Does the method of storage effectively compress the data?

Original comment by starhe...@gmail.com on 12 Jul 2009 at 11:04

GoogleCodeExporter commented 9 years ago

  It looks like once it gets deeper into the list this is actually worse. Based on
the numbers I see right now it would take about 31 hours to complete. The new 
rate is
something like 26,210.

  It starts off fast, and then gets slow.

Original comment by starhe...@gmail.com on 13 Jul 2009 at 4:33

GoogleCodeExporter commented 9 years ago

The problem comes from the fact that the storage-code guarantees uniqueness for 
every
single passwords throughout the entire database. That means (in theory) that the
2,283,200,000th password has to be compared against the 2,283,199,999 ones that 
were
already imported to prevent duplicates.
This is also the reason why importing gets slower over time - the amount of data
Pyrit as to check against increases... The numbers you see are actually still 
*much*
better than what you can expect e.g. from a SQL-database :-)

Yes, the internal file format for passwords uses zlib compression. The 
time-advantage
of having to read less data from the (slow) disk and the increased CPU-demand 
level
at around the same performance numbers, keeping the advantage of less storage
requirement.

You can experiment with Pyrit's storage code to see if you get better 
performance
when dealing with these huge wordlists. Please notice that turning up those 
knobs
will increase memory consumption while importing. Modify cpyrit_util.py (r120) 
in the
following ways.

1. Increase the maximum number of passwords that get stored in a single file by
modifying 'if len(pw_bucket) >= 20000:' in line 449. A good number would be 
50.000

2. Pyrit uses a partial index to speed up the comparision against already-stored
passwords. You may try increasing it's complexity from 2^8 to 2^12. Notice 
however
that you need to start with a fresh ~/.pyrit if you do that.
Modify 'h1_list = ["%02.2X" % i for i in xrange(256)]' and change 256 to 4096.
Modify 'pw_h1 = PasswordStore.h1_list[hash(passwd) & 0xFF]' and change 0xFF to 
0xFFF

Original comment by lukas.l...@gmail.com on 13 Jul 2009 at 7:20

GoogleCodeExporter commented 9 years ago

I've rewritten the important storage code in C which allows better 
multithreading and
added more code to take use of multiple threads reading/writing to the 
PasswordStore.

After extensive testing I've decided that this adds too much complexity to the
codebase and is not worth the effort. The performance increase does not justify 
more
than ~500 lines of new code and the current stage of the project. So I've 
thrown days
of work into the trashbin :-)

We'll have to live with what we got for now.

Original comment by lukas.l...@gmail.com on 18 Jul 2009 at 3:29

Changed state: WontFix

http403 / pyrit

import_passwords Slow with large word lists #13