Optimization to speed up WhatsApp parser

sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.

Other

940 stars 218 forks source link

Optimization to speed up WhatsApp parser #1889

Closed wladimirleite closed 11 months ago

wladimirleite commented 11 months ago

Another user contacted me about some large WhatsApp databases (inside a UFDR) that were not being parsed. We found out that the databases were causing parsing timeouts. He sent me one of the databases and I was able to process it, but it took a long time (~20 minutes only on its parsing). It is an Android database with 450 MB (not that big) but with a lot of chats (~7,000) and messages (~700,000). For now, I suggested that he increases the parsing timeout parameters.

Taking a closer look at the code, I was able to find a somewhat simple change in the parser that reduced the parsing time from 20 minutes to ~30 seconds. I am testing with a larger sample of databases and will submit a PR if everything works as expected.

wladimirleite commented 11 months ago

Just to emphasize, the database the user sent me has an unusual number of chats (~7,000). Typically there are a few hundred.

lfcnassif commented 11 months ago

That's great @tc-wleite! Thank you for looking into this!

lfcnassif commented 11 months ago

Hi @tc-wleite! Just to confirm, do you think this optimization may significantly increase memory usage or not?

wladimirleite commented 11 months ago

No, exactly the same amount of memory is used.

wladimirleite commented 11 months ago

Just to elaborate a bit about the memory question and the proposed change idea, a pseudocode of how it is today...

for (Chat chat : chats) {
    ResultSet rs = queryMessages(chat);
    for(Message m : rs) {
        chat.addMessage(m);
    }
}
doSomething(chats);

...and how it would be after the change:

ResultSet rs = queryMessages();
for(Message m : rs) {
    Chat chat = idToChat.get(m.chatId);
    chat.addMessage(m);
}
doSomething(chats);

So instead of one query per chat, a single query to read all messages. In the current code, all messages (from all chats) are read in this point of the code, before doing anything with them, so no extra memory to hold messages is requeired. To be precise, there is an additional map idToChat, to get a Chat object from its id, but it holds one entry per chat, so even for that triggering database, it would consume only ~100 kbytes.

wladimirleite commented 11 months ago

I am testing with other databases, and for 90% of the cases the gain is minimal (almost none). This change only makes a difference for databases with a lot of chats (a few thousand). The triggering database is a very specific case. In the samples I collected, the second worst case I found took ~4 minutes in this step of reading messages from the database (using the current version). And it went down to ~20 seconds. I am still collecting more samples and running a few more tests before submitting the PR.

wladimirleite commented 11 months ago

For now, I suggested that he increases the parsing timeout parameters.

Just a quick update, increasing the timeout was enough to process the large WhatsApp databases. By the way, thanks @g-togni for reporting the issue and sending me a sample database.

g-togni commented 11 months ago

Thank you @tc-wleite! I appreciate the time you spent helping me to deal with this issue. In fact, that is the first time I got a database with a lot of chats (thousands) like this

wladimirleite commented 11 months ago

I ran a large performance test with ~1,600 WhatsApp databases from different cases (85% Android and 15% iOS). Many tasks were disabled, and I also disabled WhatsApp parser options to recover deleted records, merge databases and expand individual messages. Parsing timeout was also increased to avoid timeouts.

I measured only the time spent in extractChatList() function, where the proposed changed were made (line 34 below). https://github.com/sepinf-inc/IPED/blob/a34331f667a691ce1c269105f4ddbc4a64828be2/iped-parsers/iped-parsers-impl/src/main/java/iped/parsers/whatsapp/Extractor.java#L32-L37

All times are in milliseconds:	Measure	Before
Median	1,630	947
Average	11,647	3,293
Average across the slowest 100 databases	134,400	23,460
Average across the other 1,500 databases	3,528	1,959

The newer version seems faster in general, but more noticeable speed up came from the slowest (usually largest) databases.

I will run some additional tests with the regular options (merge database, recover messages etc.) to check if chats/messages produced by the newer version matches the previous one.

lfcnassif commented 11 months ago

Hi @tc-wleite, thank you for all tests! I think there is no need for additional performance tests, it is clear the change can benefit many cases. If there is no regression about the number of extracted messages in your previous tests, I think it is safe to apply the optimization.

wladimirleite commented 11 months ago

Hi @tc-wleite, thank you for all tests! I think there is no need for additional performance tests, it is clear the change can benefit many cases. If there is no regression about the number of extracted messages in your previous tests, I think it is safe to apply the optimization.

Thanks @lfcnassif! I think the performance is clearly better, but I will run a more realistic test (with less databases, but with all common processing options enabled) to check for regressions.

wladimirleite commented 11 months ago

Testing with 4 UFDRs and all common processing options, both the previous version and with the proposed changes generated the same messages.