Closed wladimirleite closed 8 months ago
Thank you very much @tc-wleite!
I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):
# | Type | Size (GB) | Total Items | Master Processing Time (s) | Processing Time (s) using the APK Parser | Processing Time Reduction |
---|---|---|---|---|---|---|
1 | UFDR | 2.4 | 57,495 | 555 | 500 | 10% |
2 | UFDR | 18 | 118,297 | 1,494 | 649 | 57% |
5 | UFDR | 33 | 697,193 | 3,553 | 3,008 | 15% |
6 | UFDR | 37 | 1,596,925 | 4,342 | 3,546 | 18% |
So, using the APK parser would reduce a bit the processing time of Android UFDRs.
Hi wladmir
I will review tomorrow your code
could you share your test cases with me?
Em seg., 13 de nov. de 2023 16:34, Wladimir Leite @.***> escreveu:
I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 https://github.com/sepinf-inc/IPED/issues/1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):
Since there was a lot of performance changes recently, I repeated the tests:
Type Size (GB) Total Items Master Processing Time (s) Processing Time
(s) using the APK Parser Processing Time Reduction 1 UFDR 2.4 57,495 555 500 10% 2 UFDR 18 118,297 1,494 649 57% 5 UFDR 33 697,193 3,553 3,008 15% 6 UFDR 37 1,596,925 4,342 3,546 18%
So, using the APK parser would reduce a bit the processing time of Android UFDRs.
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/pull/1956#issuecomment-1809045273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S7HFD6VE6U76ZBCAO3YEJ74VAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGA2DKMRXGM . You are receiving this because your review was requested.Message ID: @.***>
I will review tomorrow your code. Could you share your test cases with me?
Sure!
I just used APKs exported from some Android UFDRs I am working with (and for some test the UFDRs themselves).
I will export some of these APKs and send them to you through Teams our internal WebDAV server.
I will export some of these APKs and send them to
you through Teamsour internal WebDAV server.
I just copied a 7-Zip (~14 GB) with some samples to \Temp\Wladimir.
could you share your test cases with me?
I'll start a crawl in BSB cases database through the night.
could you share your test cases with me?
@patrickdalla I just shared 37K APKs from 87 cases with you.
Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?
@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?
Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach?
Hi @patrickdalla! Do you mean extract certificate data as subitems of APK? It seems a good idea. In such case, metadata set in the APK could be removed, as it would be set for its children.
Just a couple of minor details:
@lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it?
I don't remember either, maybe it caused a performance impact, and since most cases don't need certificate parsing information... Or maybe the impact was caused by the Certificate carver, since it recovers a lot of items. One option would be enabling the parser and keeping the carver disabled by default. By the way, Tika has a Pkcs7Parser, also disabled by default intentionally a long ago, which parses a subset of mimes handled by your parser, and ideally a comparison between them would be good, not sure if it was done.
PS: To enable CertificateParser, we should create an issue to report it in the 4.2.0 release notes.
I have just reviewed Pkcs7Parser code from tika.
Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.
Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.
PKCS7 is not the format of the certificate used to sign the APK.
It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.
I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.
I will create an issue with this same info.
Anyway, I will finish reviewing this PR. If CategoriesToExpand.txt is configured to expand "Android Applications" category, it will create a new subitem with the certificate data. Tika is classifying this new subitem as application/x-x509-cert. When CertificateParser is reviewed later, it will parse this accordingly.
I have just reviewed Pkcs7Parser code from tika.
Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information.
Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content.
PKCS7 is not the format of the certificate used to sign the APK.
It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do.
I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika.
I will create an issue with this same info.
Thank you @patrickdalla for your analysis!
Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?
Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you?
Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.
Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon.
That would be great! But please enjoy the holiday, this is not urgent.
Hi @lfcnassif! I ran a large test and everything seems fine. Just made a few minor (mostly visual) adjustments.
Thank you @tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:
application/x-pem-file
to MakePreviewConfig.txt
intentional here?It was. but I reverted to put changes related to CertificateParser in the other PR, and forgot to revert this inclusion on MakePreview. You can revert if you want to merge this PR still today, or tomorrow I will do.
Em qua., 15 de nov. de 2023 20:35, Luis Filipe Nassif < @.***> escreveu:
Thank you @tc-wleite https://github.com/tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:
- Maybe changes in CertificateParser would fit better in #1981 https://github.com/sepinf-inc/IPED/pull/1981, but they look fine;
- @patrickdalla https://github.com/patrickdalla is the addition of application/x-pem-file to MakePreviewConfig.txt intentional here?
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/pull/1956#issuecomment-1813518382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S7LO3RWSSMA556RVW3YEVNVVAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJTGUYTQMZYGI . You are receiving this because you were mentioned.Message ID: @.***>
Thanks @patrickdalla. I also removed the changes in CertificateParser, since most of them are included in #1981 to avoid possible merge conflicts.
Closes #1953. I managed to overcome the main limitations of the library used to parse the APKs. There are a couple of TODO's related to certificates dates metadata that can be solved during review.
Processing 1K APKs (~12 GB) now took about 1 minute, while it was taking ~25 minutes before (with the default PackageParser). Index had 2.5 GB before, and now 20 MB.
A sample output of the parser: