sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
893 stars 214 forks source link

APK Parser #1953

Closed wladimirleite closed 8 months ago

wladimirleite commented 8 months ago

Currently APK files (Android Packages) are parsed by iped.parsers.compress.PackageParser. When processing UFDRs with many APK, it usually takes quite some time parsing these files, and the output is usually not very useful. Depending on the UFDR content and processing options, time spent parsing APKs is usually between 10% and 40% of the whole processing time.

I made some tests with a few libraries, and found https://github.com/hsiafan/apk-parser the best option. Although the project is not active anymore, it is pretty straightforward to use, no native code (Java only), only one very small additional JAR (208 kB), popular project (has 1.2K stars), and handled all sample APKs that I tried.

aberenguel commented 8 months ago

I'm also being impacted with this issue in a case.

image

aberenguel commented 8 months ago

It would be interesting if the ApkParser showed the signer certificates so that we can attest if APK files are authentic or not. It could be done (externally, not in the IPED) by checking if the certificates are same as the certificates in public known APKs.

For example, the SHA-1 of Signal App Certificate (in DER encoding) is 45989DC9AD8728C2AA9A82FA55503E34A8879374 It can be checked externally by the analyst in the link https://www.apkmirror.com/?post_type=app_release&searchtype=app&s=45989DC9AD8728C2AA9A82FA55503E34A8879374

wladimirleite commented 8 months ago

For example, the SHA-1 of Signal App Certificate (in DER encoding) is 45989DC9AD8728C2AA9A82FA55503E34A8879374 I can be checked externally by the analyst in the link https://www.apkmirror.com/?post_type=app_release&searchtype=app&s=45989DC9AD8728C2AA9A82FA55503E34A8879374

The parser I am implementing should show these certificates. The library I am using has some limitations, but I believe it is fine for a first approach. In the future, we can use something more up to date.

wladimirleite commented 8 months ago

The parser output looks like this: (removed some lines to fit the screen) image

My first tests used a folder with ~50 APKs. When I tried to process a full UFDR, I detected a few issue/limitations:

Although there was a major speed up in the parsing time, as the library requires a file as input (not a stream), part of the performance gain was lost because TempFileTask used time increased. The sum (Parse + Temp) was still much better than before. From ~900 APKs present in the UFDR, ~20 (~2%) couldn't not be parsed. I tested a few of them with command line tools, and they seem fine, so it was a library issue. The library only supports Signers V1 and V2. Currently there are V3, V3.1 and V4. This does not seem critical as, at least in the samples I used, the APKs still have V1 or V2 sections with their certificates (maybe to keep some backward compatibility). The main issue was that there seems to be a leak in the library, so some temporary files can't be deleted because they were still being used (opened). As the library project is read-only (since 2020), I considered making a fork, and trying to evaluate if it is possible to use streams instead of files and fixing the leak. Later, based on the format public specification, try to implement the support for newer signers, and try to fix the issues that are preventing some APKs from being processed. @lfcnassif, what are your thoughts about that?

wladimirleite commented 8 months ago

Forgot to say, I am still searching and evaluating other libraries. Forking the library that I am using would be a last resource, if no better option was found.

lfcnassif commented 8 months ago

@lfcnassif, what are your thoughts about that?

The parser output looks very nice! I would just suggest extracting timestamps as metadata Dates, so they would populate the timeline, if not being done yet.

On the other side, generally, using a library not maintained anymore is not good, it would put maintenance efforts on our side... Maybe the library could be a very stable one, but it does not seem the case given the APKs not parsed and the file leak issue, which is not good...

For file handle leaks, I successfully used the file leak detector project in the past, it is very nice and can give you the exact point into the code where the file handle was opened: https://file-leak-detector.kohsuke.org/

About TempFileTask time, it shouldn't increase, since it creates temp files for those < 1GB if tempOnSSD = true, regardless of the parser to be used ahead in the pipeline. Spooling streams to disk into the parser using TikaInputStream.getFile() should increase the parsing time itself for files greater than 1GB, since temp files weren't created for them by TempFileTask. We also have #1224 to optimize temp file creation for UFDR evidences.

wladimirleite commented 8 months ago

The parser output looks very nice! I would just suggest extracting timestamps as metadata Dates, so they would populate the timeline, if not being done yet.

I will try to add that.

On the other side, generally, using a library not maintained anymore is not good, it would put maintenance efforts on our side... Maybe the library could be a very stable one, but it does not seem the case given the APKs not parsed and the file leak issue, which is not good...

I agree.

For file handle leaks, I successfully used the file leak detector project in the past, it is very nice and can give you the exact point into the code where the file handle was opened: https://file-leak-detector.kohsuke.org/

Thanks! I used it in the past, but kind of forgot it...

About TempFileTask time, it shouldn't increase, since it creates temp files for those < 1GB if tempOnSSD = true, regardless of the parser to be used ahead in the pipeline. Spooling streams to disk into the parser using TikaInputStream.getFile() should increase the parsing time itself for files greater than 1GB, since temp files weren't created for them by TempFileTask. We also have #1224 to optimize temp file creation for UFDR evidences.

I see... All APKs in the sample UFDR are smaller than 1 GB.

Thanks @lfcnassif! For now, I will focus in the leak, which seems the most critical issue, and searching for other libraries.

wladimirleite commented 8 months ago

I managed to overcome the issue that was preventing the temporary file from being deleted.

The problem was related to the usage of MappedByteBuffer. There are several online discussions about this (old) Java behavior. When you create a MappedByteBuffer from a FileChannel, it keeps the file opened until the buffer is garbage collected. There doesn't seem to be a clean way to release the file. And GC is unpredictable, so sometimes the file deletion works but sometimes it doesn't.

wladimirleite commented 8 months ago

I would just suggest extracting timestamps as metadata Dates, so they would populate the timeline, if not being done yet.

@lfcnassif, are you sure about that? These dates are just certificates dates, not related to user activities. And the "End Date" is usually a future date (I saw years like 2110). Wouldn't it "mess" with timeline time scale?

lfcnassif commented 8 months ago

I managed to overcome the issue that was preventing the temporary file from being deleted.

The problem was related to the usage of MappedByteBuffer. There are several online discussions about this (old) Java behavior. When you create a MappedByteBuffer from a FileChannel, it keeps the file opened until the buffer is garbage collected. There doesn't seem to be a clean way to release the file. And GC is unpredictable, so sometimes the file deletion works but sometimes it doesn't.

This used to be a headache... I'm curious about your solution, I think I used an utility method from Lucene to fix this in the past, since they deal a lot with mmap files.

@lfcnassif, are you sure about that? These dates are just certificates dates, not related to user activities. And the "End Date" is usually a future date (I saw years like 2110). Wouldn't it "mess" with timeline time scale?

Maybe we could keep just the sign date. But I think the CertificateParser already extracts expiration dates, right @patrickdalla? Non user system events are extracted from evt(x) logs. And some future events also populate the timeline today, like cookie expiration dates for example...

wladimirleite commented 8 months ago

This used to be a headache... I'm curious about your solution, I think I used an utility method from Lucene to fix this in the past, since they deal a lot with mmap files.

That was used just in a specific part of the code, which reads V2 certificates and was not really necessary (small part of the file was accessed, so it could read everything to a byte array). I created a custom class to rewrite that part of code, which was useful to also fix other minor issues (like some V2 certificates were being lost).

Maybe we could keep just the sign date. But I think the CertificateParser already extracts expiration dates, right @patrickdalla? Non user system events are extracted from evt(x) logs. And some future events also populate the timeline today, like cookie expiration dates for example...

Ok! For now I will add the "start date". It should be trivial to add the "end date" later.