Telegram iOS DB parser enhancements and bug fixes (#1998, #2000, #2003 and #2005)

wladimirleite commented 10 months ago

Closes #1998, #2000, #2003 and #2005.

lfcnassif commented 10 months ago

Thank you @wladimirleite! I'll collect samples to help testing this a bit more.

wladimirleite commented 10 months ago

Processing 12 Telegram iOS databases I collected here, I got:

	Master	This PR
Chats	841	1,354
Messages	23,170	81,987

lfcnassif commented 10 months ago

Hi @wladimirleite! Here are the testing results on 19 Telegram iOS databases I collected here so far:

	Master	This PR
Chats	1,633	2,762
Messages	6,938	61,999

After commit I pushed, we are recognizing more Telegram databases. In the other hand, 13 new exceptions are being thrown because of no such table t0 (ios user account extract query) and no such table t2 (ios contacts extract query). We should handle new tables from new versions properly...

wladimirleite commented 10 months ago

After commit I pushed, we are recognizing more Telegram databases. In the other hand, 13 new exceptions are being thrown because of no such table t0 (ios user account extract query) and no such table t2 (ios contacts extract query). We should handle new tables from new versions properly...

Great! I can take a look at this, unless you already started. If not, please send me some sample databases that caused these exceptions.

lfcnassif commented 10 months ago

I can take a look at this, unless you already started. If not, please send me some sample databases that caused these exceptions.

I didn't, this week I'm in a meeting all days... I'll send you by Teams, thank you very much @wladimirleite!

lfcnassif commented 10 months ago

please send me some sample databases that caused these exceptions.

Just sent.

wladimirleite commented 10 months ago

@lfcnassif, taking a closer look in the samples you sent me and others I have here, I think that the databases with different set tables, like (t0, t2 and t3 only) and (t15, t16, t17, t19, t20 and t21), should not be identified as "application/x-telegram-db-ios".

Although they are named "db_sqlite" and used by Telegram, they contain different types of information, not related to contacts or messages chats. And they are not a newer version of the Telegram database we support.

Below is an example of a folder structure present in a very recent iOS extraction. Although all files are named "db_sqlite", only the first one is actually the kind of database that contains contacts and chats.

So I am not sure if the changes you made for #2000 are the best approach. The main issue with the previous detector code (before your changes) was expecting a "ft41" table, which is not present in all "valid" databases.

Tables actually used are t0, t2, t6, t7 and t9. My suggestion is check only for these tables, as all the samples I have, more recent or older ones, including the ones you sent me, have these tables. I would also remove detection by name, as it is detecting false positives. Do you see any drawback of making these changes? Anyway, I will commit the changes I made, after running more tests, so you can test with your set of databases when you have some time. We can revert them later, if necessary.

wladimirleite commented 10 months ago

@lfcnassif, I found another (not directly related) issue that some databases I have here are not being parsed because of an exception in contacts parsing. Looking at the code and the raw bytes, I found that the decoding process uses the function findOfset(key, tString), which may find incorrect offsets in rare situations.

For example, when extracting the "title" property of a contact, it calls that function to find a "t". Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

This is rare, but it happens a couple of times in the DB samples I have. There is no simple bug fix, as this is more a conceptual problem. I will try to evaluate how complex it would be to rewrite it.

lfcnassif commented 10 months ago

@lfcnassif, taking a closer look in the samples you sent me and others I have here, I think that the databases with different set tables, like (t0, t2 and t3 only) and (t15, t16, t17, t19, t20 and t21), should not be identified as "application/x-telegram-db-ios".

Sorry @wladimirleite, I didn't test my change for false positive detections, my fault.

Tables actually used are t0, t2, t6, t7 and t9. My suggestion is check only for these tables, as all the samples I have, more recent or older ones, including the ones you sent me, have these tables. I would also remove detection by name, as it is detecting false positives. Do you see any drawback of making these changes?

I totally agree with both recommendations, testing just for those tables and removing detection by name, I didn't know other different databases with db_sqlite name exist.

For example, when extracting the "title" property of a contact, it calls that function to find a "t". Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

I see... If it is is a heuristic to read serialized data (I didn't check the code) maybe using a proper deserialize function may help...

wladimirleite commented 10 months ago

For example, when extracting the "title" property of a contact, it calls that function to find a "t". Then it will search for 3 bytes: the length (1), 't' (116) and the type string (which is 4). The problem is that [1, 116, 4] may be another thing, not this "title" tag. So when it tries to read a string after these 3 bytes, the length of the string (next 4 bytes) may be something else (like a huge or negative number).

I see... If it is is a heuristic to read serialized data (I didn't check the code) maybe using a proper deserialize function may help...

As this is not related to the original issues, and this is clearly a bug (although rare), I am creating another issue.

wladimirleite commented 10 months ago

After the last changes all my Telegram iOS databases are being processed fine. I will test with more samples that @lfcnassif sent me.

Nevertheless, when I tried to process an iOS UFDR (the one that I am working with and motivated me to investigate the Telegram parser) it didn't parse anything. If I export all "db_sqlite" files from the UFDR and process them in a folder, it works fine. It took me quite a while to trace the issue... It seems that isEnabledForIOSUfdr() function that always return false as there is no such parameter (enabledForIOSUfdr) in ParserConfig.xml:

<parser class="iped.parsers.telegram.TelegramParser">
    <params>
        <param name="enabledForUfdr" type="bool">true</param>
        <param name="minChatSplitSize" type="int">6000000</param>
    </params>
</parser>

Not sure if this is intentional, but it seems a bit misleading to me (at least we should have the parameter set to false in the configuration file, so the user can enable/disable it).

Anyway, It seems to me the existing parameter (enabledForUfdr) is enough, as there is already some "intersection" with the behavior of phoneParsersToUse in ParsingTaskConfig.txt. Another point is that other similar parsers (e.g. WhatsApp) do not have a similar parameter.

The parameter extractMessages is used also but missing to ParserConfig.xml. However its default value was set to true (in the code).

So my idea is to keep only enabledForUfdr configuration and add extractMessages to ParserConfig.xml. What do you think @lfcnassif?

lfcnassif commented 10 months ago

Hi @wladimirleite! The iOS Telegram parser was experimental and not very well tested initially, that's why I created isEnabledForIOSUfdr with a default false value. Today chats/messages from PA should be imported from the UFDR, right?

Yes, after this PR with fixes, enhancements and many tests, I think we can remove that option, running the internal parser by default on iOS databases, and also expose extractMessages.

wladimirleite commented 10 months ago

Today chats/messages from PA should be imported from the UFDR, right?

Yes. The problem in my case was that PA didn't parse all chats present in the phone, because some database files are in a different folder structure (although the internal SQLite structure is the expected one), so I wanted to use IPED parser.

hauck-jvsh commented 10 months ago

Amazing working @wladimirleite, I didn't test the IOS parser very well at the time I start the parser, adn didn't update the code since that day, so it is impressive that it still worked. Telegram for changed a lot since that time.

wladimirleite commented 10 months ago

Not directly related to the issues addressed by this PR, but one of the "db_sqlite" files @lfcnassif sent me (with 265,424,896 bytes) has a group with a lot of members (~16,000) and a many messages (~100,000). It is taking 10 hours (!) to be processed because of the loop below, which sets "Communication:To" properties for each message and each group member:

for (Message m : messages) {
    Metadata meta = new Metadata();
    (...)
    for (Long id : groupChat.getMembers()) {
        if (id != m.getFrom().getId())
            meta.add(org.apache.tika.metadata.Message.MESSAGE_TO, e.getContact(id).toString());
    }
    (...)
}

Tika's Metadata.add() (and set()) does not seem to be designed to handle multivalued properties with thousands of values, as it creates a new String[] after each new value is inserted (and there is no way of setting a String array in a single step). It would be possible to overcome this slowness with a subclass, but its main internal storage is private (a Map<String, String[]>), so almost all code would have to be copied. If Metadata class was Cloneable, it would also be possible to overcome the problem, as all messages of a chat share most properties, so a base Metadata object would be built just once per chat, not per message.

Just as a test, I found out that if a single long String value is set (group members are concatenated using comma as separator), the processing takes 20 minutes. And if only the group name is set as Message.MESSAGE_TO, it takes only 3 minutes.

Large groups are somewhat common in Telegram... So the options I considered so far were:

Write a subclass of Metadata (bad because of code duplication and conflict with future Tika's versions);
Use the group name as Message.MESSAGE_TO, if the group has more than N members (a parser parameter, with a default value relatively small, like 256);
Always use the group name as Message.MESSAGE_TO for extracted individual messages.

@lfcnassif, do you see any other solution for this issue? Or should we just leave it as it is?

lfcnassif commented 10 months ago

@lfcnassif, do you see any other solution for this issue? Or should we just leave it as it is?

Hi @wladimirleite, great catch! I think we should try to optimize this code for sure, taking 10h to process a 250MB file is not reasonable.

Write a subclass of Metadata (bad because of code duplication and conflict with future Tika's versions);

We already have a iped.engine.tika.SyncMetadata class subclassing Tika's Metadata to workaround concurrency issues when a parsing thread triggers a timeout, it is disconnected but continues running (Thread.stop() usage is unsafe) and eventually used to cause ConcurrentModificationException in the past. I understand it may cause conflicts with future Tika's Metadata versions, but it was the only solution I found at the time, it seems to be working fine. We can move SyncMetadata to iped-parsers module.

Use the group name as Message.MESSAGE_TO, if the group has more than N members (a parser parameter, with a default value relatively small, like 256);

Always use the group name as Message.MESSAGE_TO for extracted individual messages.

This is a very interesting approach! Actually @abdalla-mar was working on a dark web forums crawler/decoder for IPED and he used this approach of using the forum/topic/thread name as Message.TO, since groups don't exist, users send messages to forums/threads that can be read by any user with permission. I like the idea, it would be clear in the Graph that it wasn't a direct communication, but a group entity would show up as an intermediary node. I would vote to always use the same behavior, always put the group name as Message.MESSAGE_TO or not. The only issue is that it would be a behavior change, not sure if it is a real problem.

wladimirleite commented 10 months ago

We already have a iped.engine.tika.SyncMetadata class subclassing Tika's Metadata to workaround concurrency issues when a parsing thread triggers a timeout, it is disconnected but continues running (Thread.stop() usage is unsafe) and eventually used to cause ConcurrentModificationException in the past. I understand it may cause conflicts with future Tika's Metadata versions, but it was the only solution I found at the time, it seems to be working fine. We can move SyncMetadata to iped-parsers module.

I wasn't aware of SyncMetadata class . This could be a possible solution...

I would vote to always use the same behavior, always put the group name as Message.MESSAGE_TO or not.

I agree.

The only issue is that it would be a behavior change, not sure if it is a real problem.

Well, I don't think it is not critical, but as it would changes the current behavior, I would try to avoid, if possible.

I will take a closer look into SyncMetadata, and see if it would be possible to overcome the slowness, with adding too much complexity.

patrickdalla commented 10 months ago

Hi @wladimirleite,

I came across the same performance issue with the addition of values on multivalued metadata when implementing #1337 Sqlite split. I had to add multiple locations from tables that had columns with latitude and longitude names, and for these locations to be plotted on map they had to be put on the correct metadata.

To solve it, in SqliteSplit branch, I have implemented an intermediary metadata subclass IpedMetadata with a method "set" that accepts an arraylist.

Maybe it can be used here too. Or, if you find a better solution, I can update that branch.

wladimirleite commented 10 months ago

I came across the same performance issue with the addition of values on multivalued metadata when implementing #1337 Sqlite split. I had to add multiple locations from tables that had columns with latitude and longitude names, and for these locations to be plotted on map they had to be put on the correct metadata.

To solve it, in SqliteSplit branch, I have implemented an intermediary metadata subclass IpedMetadata with a method "set" that accepts an arraylist.

Maybe it can be used here too. Or, if you find a better solution, I can update that branch.

Hi @patrickdalla! Checking now IpedMetadata code, I am not sure if it would help (at least in the case I am dealing with). https://github.com/sepinf-inc/IPED/blob/48e9e43dcba51521ecf3c71ac58a38ead2d38264/iped-utils/src/main/java/iped/utils/tika/IpedMetadata.java

It calls TIKA's Metadata.set(String key, String[] values), right? But if you look inside TIKA's implementation of that method , it still has a loop adding one value at a time, which is what I was trying to avoid. https://github.com/apache/tika/blob/7fcec6448727f939dd59b7bc375169ef02be99ef/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java#L373-L384 I created a subclass of IpedMetadata that sets an array of Strings in a single step, but that required a bit more code.

lfcnassif commented 10 months ago

Another advantage of using the group name as Message.MESSAGE_TO would be a faster and less disk consuming GraphTask, since it will need to create, for a group with N members and M messages, CSVs with just M lines (edges) instead of the current N x M lines. Those relationship CSVs would also be imported faster by neo4j and maybe the Graph analysis UI would become a bit faster.

lfcnassif commented 10 months ago

And if only the group name is set as Message.MESSAGE_TO, it takes only 3 minutes.

Just read the messages again and realized it is a 200x speed up. The whole processing or just the ParsingTask took 3min? Anyway, it is a huge performance impact, so I vote to use group names (maybe plus group IDs) as MESSAGE_TO, although it changes behavior. But maybe it should be implemented in another PR, since it should be applied to all Chat parsers.

wladimirleite commented 10 months ago

Just read the messages again and realized it is a 200x speed up. The whole processing or just the ParsingTask took 3min? Anyway, it is a huge performance impact, so I vote to use group names (maybe plus group IDs) as MESSAGE_TO, although it changes behavior. But maybe it should be implemented in another PR, since it should be applied to all Chat parsers.

The whole processing took ~3 minutes.

Thinking a bit here, conceptually speaking, for large groups, assuming that a message sent to the group is a direct communication between the sender and each group member may not be the more realistic interpretation. In that sense, using the group as an intermediate entity seems more logical.

Definitely another PR would be better. I will open an issue tomorrow to keep track of this.

A last comment, I managed to write a subclass of TIKA's Metadata that handles multivalued properties with many values much faster. However I found other bottlenecks. So using the group name (+ ID) makes even more sense. I am keeping the solution saved, as it may be useful in the future (perhaps in some other situations), but I reverted the changes for now. For one of the bottlenecks, which is in IPED's code, not TIKA, and completely isolated, I will submit a separate PR tomorrow.

lfcnassif commented 10 months ago

Thinking a bit here, conceptually speaking, for large groups, assuming that a message sent to the group is a direct communication between the sender and each group member may not be the more realistic interpretation. In that sense, using the group as an intermediate entity seems more logical.

Out of curiosity, what was the total case disk usage reduction? How was it specifically for iped/neo4j/csv, iped/neo4j/data and iped/index folders?

wladimirleite commented 10 months ago

Out of curiosity, what was the total case disk usage reduction? How was it specifically for iped/neo4j/csv, iped/neo4j/data and iped/index folders?

That is a great point! Previous times I mentioned were with GraphTask disabled. I am running again enabling it and will post disk usage numbers. By the way, with a FastMetadata class and solving a couple of other bottlenecks, processing time went down from ~10 hours to ~2 hours, which is better, but still too much (GraphTask disabled, IndexTask dominated the processing time).

wladimirleite commented 10 months ago

Out of curiosity, what was the total case disk usage reduction? How was it specifically for iped/neo4j/csv, iped/neo4j/data and iped/index folders?

I am running again enabling it and will post disk usage numbers.

With the GraphTask enabled, processing time went from ~2h to ~7h.

Test	Processing Time	Index Space	Neo4j\csv Space	Neo4j\data Space
Current (each group member in "Communication:To")	~7 hours	59 GB	30 GB	154 GB
Group name and ID as "Communication:To"	4 minutes	320 MB	6 MB	41 MB

lfcnassif commented 10 months ago

With the GraphTask enabled, processing time went from ~2h to ~7h.

Test Processing Time Index Space Neo4j\csv Space Neo4j\data Space Current (each group member in "Communication:To") ~7 hours 59 GB 30 GB 154 GB Group name and ID as "Communication:To" 4 minutes 320 MB 6 MB 41 MB

Awesome! With those numbers, I think using Group name as Message.TO is definitely the way to go, although users may need some time to get used to the new group rendering in the Graph.

Maybe a new icon for Group entities in the Graph can help to identify groups. I think we already have a not used icon for organizations rendered as 2/3 people that could be used for group nodes.

wladimirleite commented 9 months ago

There are other things that could be improved in the Telegram internal parser, but this PR already include too many changes. I am stopping here, so this can be reviewed.

lfcnassif commented 9 months ago

Thank you very much @wladimirleite!

@hauck-jvsh, since you are the original author of this parser, could you review this PR when you have available time?

hauck-jvsh commented 9 months ago

I can review, if time isn't a problem, this when I return from my vacation.

hauck-jvsh commented 8 months ago

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

wladimirleite commented 8 months ago

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

I am sending you (through Teams) the samples I have (the whole folder, which include also a few Android databases). It also includes iOS samples @lfcnassif sent me recently.

lfcnassif commented 8 months ago

I will start reviewing this tomorrow, is there any chances of getting some IOS databases? @lfcnassif, @wladimirleite

Thank you @hauck-jvsh! I think all my samples are included in @wladimirleite's data set.

hauck-jvsh commented 8 months ago

Excellent work, you have rewrite the PostBoxCoding from scratch, now you are decoding the entire object instead of the previous approach of find specific key inside the bytes. I think that now it is much more general, preventing decoding data with the wrong type or even data from other places.

hauck-jvsh commented 8 months ago

@wladimirleite, I'm running some tests here and it is taking a lot of time to process. I think that this is same problem you mentioned before, that messages have thousands of number in the messageTo metadata. Did you forget to push the changes?

wladimirleite commented 8 months ago

@wladimirleite, I'm running some tests here and it is taking a lot of time to process. I think that this is same problem you mentioned before, that messages have thousands of number in the messageTo metadata. Did you forget to push the changes?

Yes, it takes way too long for some DBs. Discussing with @lfcnassif, I created a new issue (#2011), as it is not directly related to the issues addressed by this PR, and ideally any solution (probably changing the current behavior) would have to be applied to other chat parsers. To test this PR, I commented out the loop that sets "Communication:To" property for each message (for groups chats).

hauck-jvsh commented 8 months ago

After digging a little I manage to extract some thumbs from the database.

hauck-jvsh commented 8 months ago

Just one thing, previously the default behavior was to use the UFED resultfor IOS and the internal parser for Android, but now the default behavior will change, but it looks good to me, I think that we can merge. @wladimirleite can you check the extraction of thumbs please?

wladimirleite commented 8 months ago

@wladimirleite can you check the extraction of thumbs please?

Sure! I will take a look as soon as possible.

wladimirleite commented 8 months ago

@wladimirleite can you check the extraction of thumbs please?

@hauck-jvsh, the thumbnail extraction worked great! That is a really useful enhancement! Thank you!!!

I ran a few more tests, checking with an iPhone that I am working on, and everything looks fine.

I just pushed two minor commits (code formatting and preserving line breaks in messages). The line breaks make some messages more readable (I made a similar change in WhatsApp parser recently).

hauck-jvsh commented 8 months ago

I think that this is ready for merging. I had to reinstall my eclipse and I didn't apply the project formatting style.

lfcnassif commented 8 months ago

Sorry guys, I had to force push to fix a wrong conflict resolution pushed before, seems fine now.

Thank you very much @wladimirleite and @hauck-jvsh for this great work!

sepinf-inc / IPED

Telegram iOS DB parser enhancements and bug fixes (#1998, #2000, #2003 and #2005) #1999