APK Parser (#1953) - Githubissues

wladimirleite commented 8 months ago

Closes #1953. I managed to overcome the main limitations of the library used to parse the APKs. There are a couple of TODO's related to certificates dates metadata that can be solved during review.

Processing 1K APKs (~12 GB) now took about 1 minute, while it was taking ~25 minutes before (with the default PackageParser). Index had 2.5 GB before, and now 20 MB.

A sample output of the parser:

lfcnassif commented 8 months ago

Thank you very much @tc-wleite!

wladimirleite commented 8 months ago

I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):

#	Type	Size (GB)	Total Items	Master Processing Time (s)	Processing Time (s) using the APK Parser	Processing Time Reduction
1	UFDR	2.4	57,495	555	500	10%
2	UFDR	18	118,297	1,494	649	57%
5	UFDR	33	697,193	3,553	3,008	15%
6	UFDR	37	1,596,925	4,342	3,546	18%

So, using the APK parser would reduce a bit the processing time of Android UFDRs.

patrickdalla commented 8 months ago

Hi wladmir

I will review tomorrow your code

could you share your test cases with me?

Em seg., 13 de nov. de 2023 16:34, Wladimir Leite @.***> escreveu:

I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 https://github.com/sepinf-inc/IPED/issues/1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):

Since there was a lot of performance changes recently, I repeated the tests:

Type Size (GB) Total Items Master Processing Time (s) Processing Time

(s) using the APK Parser Processing Time Reduction 1 UFDR 2.4 57,495 555 500 10% 2 UFDR 18 118,297 1,494 649 57% 5 UFDR 33 697,193 3,553 3,008 15% 6 UFDR 37 1,596,925 4,342 3,546 18%

So, using the APK parser would reduce a bit the processing time of Android UFDRs.

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/pull/1956#issuecomment-1809045273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S7HFD6VE6U76ZBCAO3YEJ74VAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGA2DKMRXGM . You are receiving this because your review was requested.Message ID: @.***>

wladimirleite commented 8 months ago

I will review tomorrow your code. Could you share your test cases with me?

Sure! I just used APKs exported from some Android UFDRs I am working with (and for some test the UFDRs themselves). I will export some of these APKs and send them to ~~you through Teams~~ our internal WebDAV server.

wladimirleite commented 8 months ago

I will export some of these APKs and send them to ~~you through Teams~~ our internal WebDAV server.

I just copied a 7-Zip (~14 GB) with some samples to \Temp\Wladimir.

lfcnassif commented 8 months ago

could you share your test cases with me?

I'll start a crawl in BSB cases database through the night.

lfcnassif commented 8 months ago

could you share your test cases with me?

@patrickdalla I just shared 37K APKs from 87 cases with you.

patrickdalla commented 8 months ago

Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?

@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?

wladimirleite commented 8 months ago

Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?

Hi @patrickdalla! Do you mean extract certificate data as subitems of APK? It seems a good idea. In such case, metadata set in the APK could be removed, as it would be set for its children.

Just a couple of minor details:

I think that setting the type of the properties "certificate:notafter" and "certificate:notbefore" as dates would avoid the extra properties with ":date" suffix.
Most property names use camel case, so perhaps "notAfter" and "notBefore" would be more readable.

lfcnassif commented 8 months ago

@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?

I don't remember either, maybe it caused a performance impact, and since most cases don't need certificate parsing information... Or maybe the impact was caused by the Certificate carver, since it recovers a lot of items. One option would be enabling the parser and keeping the carver disabled by default. By the way, Tika has a Pkcs7Parser, also disabled by default intentionally a long ago, which parses a subset of mimes handled by your parser, and ideally a comparison between them would be good, not sure if it was done.

PS: To enable CertificateParser, we should create an issue to report it in the 4.2.0 release notes.

patrickdalla commented 8 months ago

I have just reviewed Pkcs7Parser code from tika.

Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.

Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.

PKCS7 is not the format of the certificate used to sign the APK.

It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.

I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.

I will create an issue with this same info.

patrickdalla commented 8 months ago

Anyway, I will finish reviewing this PR. If CategoriesToExpand.txt is configured to expand "Android Applications" category, it will create a new subitem with the certificate data. Tika is classifying this new subitem as application/x-x509-cert. When CertificateParser is reviewed later, it will parse this accordingly.

lfcnassif commented 8 months ago

I have just reviewed Pkcs7Parser code from tika.

Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.

Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.

PKCS7 is not the format of the certificate used to sign the APK.

It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.

I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.

I will create an issue with this same info.

Thank you @patrickdalla for your analysis!

lfcnassif commented 8 months ago

Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?

wladimirleite commented 8 months ago

Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?

Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.

lfcnassif commented 8 months ago

Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.

That would be great! But please enjoy the holiday, this is not urgent.

wladimirleite commented 8 months ago

Hi @lfcnassif! I ran a large test and everything seems fine. Just made a few minor (mostly visual) adjustments.

lfcnassif commented 8 months ago

Thank you @tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:

Maybe changes in CertificateParser would fit better in #1981, but they look fine;
@patrickdalla is the addition of application/x-pem-file to MakePreviewConfig.txt intentional here?

patrickdalla commented 8 months ago

It was. but I reverted to put changes related to CertificateParser in the other PR, and forgot to revert this inclusion on MakePreview. You can revert if you want to merge this PR still today, or tomorrow I will do.

Em qua., 15 de nov. de 2023 20:35, Luis Filipe Nassif < @.***> escreveu:

Thank you @tc-wleite https://github.com/tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:

Maybe changes in CertificateParser would fit better in #1981 https://github.com/sepinf-inc/IPED/pull/1981, but they look fine;

@patrickdalla https://github.com/patrickdalla is the addition of application/x-pem-file to MakePreviewConfig.txt intentional here?

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/pull/1956#issuecomment-1813518382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S7LO3RWSSMA556RVW3YEVNVVAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJTGUYTQMZYGI . You are receiving this because you were mentioned.Message ID: @.***>

lfcnassif commented 8 months ago

Thanks @patrickdalla. I also removed the changes in CertificateParser, since most of them are included in #1981 to avoid possible merge conflicts.

sepinf-inc / IPED

APK Parser (#1953) #1956

Type Size (GB) Total Items Master Processing Time (s) Processing Time