Open enseitankado opened 4 years ago
Thanks! Glad you like it!
I've been reluctant to give away too much information about the process and the analysis code itself, because if i can confirm it doesn't violate GDPR (and KVKK in my country), i'm planning to deploy a search engine allowing users to search by email, or crunch password lists with their prefered parameters.
If it violates GDPR, i'll just release the same service after hashing emails, or just dropping the email addresses table all together (This will still allow networking/grouping based on the email id and statistical analysis, but email address itself will be lost)
That being said;
I've used a python script to crawl a directory and process all the text dumps in that directory into an SQL table. It was running on a single thread in the background for around a week or more.
I was processing text dumps by mainly pattern matching the lines to find out the dump format(most of them are formatted username:password, but not all).
If it failed to recognize any pattern, it would mark that line completed either way, not write anything to the database, but instead write this line to a potfile. I'd later normalize the format of all the lines in this potfile and copy it into the crawled directory to add it back into the mix.
I've also set up a small environment allowing me to pause/continue the processing and save the state - saving which files are proccessed to which line, and committing changes to the sql connection.
At first, it was a simple table to store username:password, and the source dump this credential was found in.
I've decided to do this as an intermediate step first to limit the number of write cycles to my SSD disk. (doing only insert after insert lets me commit database changes in chunks)
After this step was completed, i've grouped up repetetive credentials, and added an occurrence row to the table instead.
Next step was splitting this table to the Boyce-Codd Normal Form
. This removed all the redundancy from the database, allowing for much faster queries, and dropping the database size by a lot.
For dbms choice, i've used SQLite3 because its much more faster to work with on local, because sqlite is pretty much an fopen() while mysql,psql is unix socket operations and are much slower than direct file operations.
Since all the usernames and passwords and credentials are in their own pretty tables, rest of the analysis boils down to simple SQL queries.
For example, to generate the top 10.000.000 passwords all i had to do was:
.output 10m.txt
SELECT value FROM password ORDER BY occurrence DESC LIMIT 10000000;
Many of the statistical data boils down to simple uses of COUNT(*) and SUM(occurrence);
Example: Average length of password which were only found once
SELECT AVG(LENGTH(value)) FROM password WHERE occurrence=1;
Worst part of this project was one of the redundancy-reducing intermediate steps because i constantly copy paste lines from table definitions when constructing new ones, and i forgot an extra UNIQUE
keyword somewhere.
I was inserting the result of another select query to this table(like INSERT INTO new(...) SELECT ...
)
That select query took 21 hours. And it threw an error at the end because insert was violating a constraint. I had to repeat.
I hope this answers the parts you were curious about. If not, please feel free to ask!
Awesome, really fast answer. I think the pandemic is going on there. ;)
First of all thank you for the answer.
Yes, basically, in order to extract something understandable from this amount of data, it must be transformed into a questionable structure. As far as I understand, in the next steps, you will strive for different reporting. I think waiting for the results of the query sentences will be the most annoying part of the job. There are time series DBMS that give successful results in IoT applications. Perhaps there may be more suitable DBMSs to process this kind of data. I do not know about this.
According to the breach list (https://monitor.firefox.com/breaches) and the compilation list here ([https://www.pcmag.com/news/collection-1-breach-is-huge-but-should-you- be-worried](https://www.pcmag.com/news/collection-1-breach-is-huge-but-should-you- be-worried)) I think there is data to analyze around 4TB. At this point, I feel that it is necessary to work with distributed computing power.
I would like you to clarify a subject. What do you think to do differently than at first (https://haveibeenpwned.com/)?
Hello! Sorry for the delay, just woke up!
I think waiting for the results of the query sentences will be the most annoying part of the job.
Only for the intermediate steps though. Pushing from text dumps to database is fast, and searching for passwords/emails at the end is fast. But steps to decrease db size and increase query performances between that takes from a few hours to a day.
According to the breach list (https://monitor.firefox.com/breaches) and the compilation list here ([https://www.pcmag.com/news/collection-1-breach-is-huge-but-should-you- be-worried](https://www.pcmag.com/news/collection-1-breach-is-huge-but-should-you- be-worried)) I think there is data to analyze around 4TB.
Hmmm, 4 TB is quite larger than i anticipated. https://haveibeenpwned.com/ has 10 billion accounts. Considering I've indexed 1 billion(down to 750M when filtered) i think data to process is around 1-1.5 TB.
Im confident structure of my database wont scale linearly, and at the end i'll have a db file around 150-200GB (this number is only speculation though, i have no calculation to back it up)
At this point, I feel that it is necessary to work with distributed computing power.
It would definitely help, but that means investment. Im currently a university student with pretty much no direct and stable income. Since this is not a research im doing with the university im pretty sure im locked out of any resources/funds they can provide.
From what it looks like i'll have to accept stretching project to a few months for now.
I would like you to clarify a subject. What do you think to do differently than at first (https://haveibeenpwned.com/)?
Well primary difference on the services hibp project provides and i want to provide is, I dont want to just return a "pwned/not pwned" result. I want to service passwords as well.
I feel like searching an email address and getting its password from old leaks is a direct GDPR violation, and apart from that, kind of a dick move.
There are websites providing that(1 reliable one that i know of), however, i dont want to go down that road. Probably most favorable service im going to be able to provide is email address networking, and custom wordlist generation.
I want email address networking to be such a service that you enter an email address, and it returns you a list of email addresses that are probably to be owned by the same person (and i think there is enough metadata to track a portion of that).
I want custom wordlist generation to be able to provide quick and reliable wordlists on peoples needs.
For example, i want it to be able to provide wordlists such as:
From what i see, GDPR is okay with the second part, and email networking part and the first mentioned part is going to be a problem.
If i can get everything up and running, I don't want to put it behind a paywall, but i'll probably put generating wordlists larger than a certain number of lines behind a paywall, because that can take a few seconds or more, and i'll have a queue system. I might also add a few more premium services behind a paywall, but i'm not certain yet. I think thats justified.
@FlameOfIgnis to scale DB for both size and speed I would strongly recommend uploading data to s3 and query/dedup with AWS Athena. FWIW in personal testing I've scanned ~5b creds in under 2 mins.
Hello! Im considering moving the project to its own server soon because im about to exhaust my resources on the local setup.
I've scanned ~5b creds in under 2 mins.
I've structured it in a way that im getting pretty much instant results for single-result queries.
However, ordering/scanning entire tables take a bit (as expected)
And running JOIN operations for a large subset is just catastrophic. So, for localization i need to filter on the email addresses, then join that with passwords to sort it.
This causes a double JOIN operation on millions of rows, which sucks.
Here is the time from the localization query i use: (i forgot a sort operation, it shouldn't exceed 15 seconds)
Yes, the 2min reference I gave is against an unsorted dataset of ~5b records. You can certainly use Athena, Bigquery, or similar to make the sort/join/preprocessing on the data faster. Considering your cost problem those services would probably cheaper than a dedicated server as well.
Hi @FlameOfIgnis @pooki3bear Thx for your Great work @enseitankado But I have some questions about your discuss:
Hello @kivik92!
For local processing purposes, i really recommend using sqlite instead of MySQL or any other dbms, because its a lot easier to take snapshots or backups of an sqlite database, and when configured correctly it performs much faster than others (since you don't need query isolation/no parallel queries need to be processed)
If all you want to do is pull all the password wordlists together and remove non-unique ones, how are you going to do the ordering? If ordering is not important for you, you can just append all the wordlists to a single file, and do:
cat wordlists/* >> compilation.txt
sort -u compilation.txt > unique-compilation.txt
to get a unique list.
However, I always prefer smaller wordlists with hashcat wordrules for cracking, instead of relying on huge wordlists. I really recommend giving it a shot if you haven't! Good luck and cheers!
Hello @FlameOfIgnis Thx for your faster answer I used command that you wrote, but I had many problems and errors with it, that’s why I desired to use dbms such as MySQL, I have resources for this About sorting wordlist- Yes I want to sort wordlists for example by symbols, 8 symbols for WiFi Cracking, 12 symbols for some services etc. Maybe we can continue our talk via email? Cheers
Hi again @kivik92, sorry about the delayed response. If you really want to use a dbms for this, then all you'll need is a simple script (preferably python) to read all the wordlists in a directory, and insert the contents to an sql table.
Getting all the wordlists should be a trivial task with os.walk
, and then you'll need to read the wordlists one line at a time, and send an INSERT query to insert the contents to your table.
Consider keeping a counter and committing db changes every few hundred or so queries (and not after every query) to speed up the process.
Then you can sort/filter the results however you wish, and export the results to a wordlist.
Good luck!
First of all, thank you for your work. What method or tool you used to process 1B data will be useful if you share the original methods you used. We are also very interested in this part of the study.