sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
924 stars 217 forks source link

Performance of carved items temporary file access #1912

Closed wladimirleite closed 7 months ago

wladimirleite commented 10 months ago

While working on other issues, I noticed that processing most of a UFDR, the last items (mostly carved) took more time than I expected. I/O and CPU usage was low, which usually indicates that something could be improved.

After some investigation, I found that what was consuming extra time was the fact that carved items (from a few SQLite databases) couldn't access the temporary parent file because it was already deleted. Many items were falling in the exception shown below.

https://github.com/sepinf-inc/IPED/blob/5f5ab2d3f6a17fa4825830c62127095c1fe2c2b6/iped-engine/src/main/java/iped/engine/data/Item.java#L547-L557

So for each carved item, it seems that the parent file had to be unpacked (from the UFDR), so the carved piece can be accessed. There wasn't any disk activity because everything was in the SO cache, but still took an extra time per item.

In another part of the code, I noticed that if the processing queue is long (more than 100K items), carved files are processed as soon as possible, so usually their parent temporary file is available. Otherwise (which was my case), they are placed in the queue and there is no guarantee whether the parent temporary file will be there.

I made a code change to force the temporary file to be kept until its carved childs are processed and ran a comparison test with 7 UFDRs and 2 E01s.

# Type Size (GB) Total Items Total Carved Current Carved Corrupted Current Processing Time (s) Current Parent Temp Miss Modified Processing Time (s) Modified Carved Corrupted
1 UFDR 2.4 57,495 9,491 1,393 1,565 7,020 235 1,393
2 UFDR 18 118,297 16,862 543 1,760 1,384 1,677 529
3 UFDR 19 1,035,352 15,535 4,116 2,217 8,074 1,877 4,114
4 UFDR 22 1,082,889 2,092 40 2,351 486 2,393 40
5 UFDR 33 697,193 28,583 1,591 5,289 5,415 3,012 1,591
6 UFDR 37 1,596,925 68,637 6,693 6,809 12,936 5,134 6,689
7 UFDR 37 1,337,394 20,463 3,282 2,854 10,722 2,886 3,280
8 E01 34 1,747,095 90,838 2,515 7,815 36,571 6,712 2,515
9 E01 37 1,558,515 133,827 3,361 6,926 17,282 5,714 3,361

The evidence that motivated the investigation was the first one, for which the processing time was reduced from ~26 minutes to ~4 minutes. It was the most extreme difference. For other evidence files (2, 4 and 7) there are no differences at all, but for the rest there was some noticeable improvement.

Tests had most of the common tasks enabled, except OCR and audio transcription. The column "Current Parent Temp Miss" is the number of times that the exception shown in the code snippet above was thrown. With the modified, no item threw exceptions at that function.

In some of the cases, the number of "Carved Corrupted" was reduced by a few items. Not sure if it is related, but I was wondering if it was possible that in very rare cases the parent file deletion happens (it is not blocked by "file in use") in a certain point of the carved child processing and causes some parsing exceptions. Not sure if this scenario is even possible...

This happens in a part of the code I was not familiar with... @lfcnassif, does my analysis make sense? Did you try/consider this approach (avoid deleting the parent temporary file before carved childs are processed)?

lfcnassif commented 10 months ago

Hi @tc-wleite!

@lfcnassif, does my analysis make sense? Did you try/consider this approach (avoid deleting the parent temporary file before carved childs are processed)?

Yes and yes! Actually processing carved files immediately should be the default behavior to reuse the parent temp file and to take advantage of CPU internal caches, as far as I remember. But when the processing queue has a few items (likely to become empty soon), there is a rule in Worker class to enqueue carved files aiming to better use the CPU cores. Possibly that is allowing the parent temp file deletion. I think there is some protection, AFAIK in Item class dispose() method and other calling places I don't remember exactly (and I knew it isn't perfect, so the catch and comment you quoted above exists). Anyway I thought it was a very rare situation and I thought it wasn't causing such a large impact noticed and measured by you. So thank you very much for catching this, testing and debugging the issue! I'm curious about your solution!

PS: Original behavior of container expansion was also to process subitems immediately to take advantage of system/hardware caches if processing queue wasn't empty or almost empty. Unfortunatelly that was causing issues with JEP and I changed the behavior to always enqueue container subitems.

wladimirleite commented 10 months ago

Thanks for the detailed explanation! I will submit a draft PR, so you can take a look at the modified code. For a quick test, I just commented out the temporary file deletion. Then I noticed that it reduced the processing time a lot (I was using the first UFDR in the table) and decided to try to find an actual solution.

I think the impact depends a lot on the evidence file content. But in general, it seems to help.

lfcnassif commented 10 months ago

Thank you @tc-wleite!

blocked by "file in use"

Maybe I was relying on this too, but I learned from you it isn't reliable when you fixed the imagemagick garbage temp file clean up at some point in the past.

wladimirleite commented 10 months ago

@lfcnassif, the commit above contains the changes I made. I tried to minimize the changes and use the existing TemporaryResources approach, as this is a critical part of the code. Maybe there is a better way of achieving the same goal...

lfcnassif commented 10 months ago

Thank you @tc-wleite, I'll try to take a look soon...

lfcnassif commented 10 months ago

Hi @tc-wleite, I just took a look at your commit, the approach seems good to me!

But, for now, I imagined one source of issues: Item.dispose() is called before finishing the Item processing for those Items which are re-enqueued by AbstractTask.reEnqueueItem(IItem item, int queue) method. This is done to clear resources, like parsed text cache present on the heap, it was a source of OOMEs in the past before being cleared.

In some of the cases, the number of "Carved Corrupted" was reduced by a few items. Not sure if it is related, but I was wondering if it was possible that in very rare cases the parent file deletion happens (it is not blocked by "file in use") in a certain point of the carved child processing and causes some parsing exceptions. Not sure if this scenario is even possible...

About this, I think it is not likely to be the cause since just before the parse(...) is called on the carved item, an InputStream to its contents is opened, what should avoid the deletion of the related file, AFAIK.

wladimirleite commented 10 months ago

But, for now, I imagined one source of issues: Item.dispose() is called before finishing the Item processing for those Items which are re-enqueued by AbstractTask.reEnqueueItem(IItem item, int queue) method. This is done to clear resources, like parsed text cache present on the heap, it was a source of OOMEs in the past before being cleared.

I see. Well, I think it would be possible to handle this situation (i.e. avoid the disposal of the parent temporary file, in the case of re-enqueue). Not sure if it would be a good idea, because it could take a long time until the re-enqueued carved items new queue starts to be processed, and the temporary file would be sitting there...

But for these cases, the behavior would be the same as it is in the current version, right?! (i.e. the parent temporary file would be deleted and the item will need to be read in a different way, without reusing the parent temporary file).

lfcnassif commented 10 months ago

Not sure if it would be a good idea, because it could take a long time until the re-enqueued carved items new queue starts to be processed, and the temporary file would be sitting there...

I also thought about this side effect, in the worst case it could result in out of space issues...

But for these cases, the behavior would be the same as it is in the current version, right?! (i.e. the parent temporary file would be deleted and the item will need to be read in a different way, without reusing the parent temporary file).

Yes, I think so. I need to better analyze possible side effects that might happen...

wladimirleite commented 9 months ago

Since there was a lot of performance changes recently, I repeated the tests:

# Type Size (GB) Total Items Total Carved Previous Test Processing Time (s) Master Processing Time (s) Modified Processing Time (s) Processing Time Reduction
1 UFDR 2.4 57,495 9,491 1,565 555 230 59%
2 UFDR 18 118,297 16,862 1,760 1,494 1,496 -
3 UFDR 19 1,035,352 15,535 2,217 1,490 1,497 -
4 UFDR 22 1,082,889 2,092 2,351 1,784 1,558 13%
5 UFDR 33 697,193 28,583 5,289 3,553 2,844 20%
6 UFDR 37 1,596,925 68,637 6,809 4,342 4,196 3%
7 UFDR 37 1,337,394 20,463 2,854 2,259 2,180 3%
8 E01 34 1,747,095 90,838 7,815 6,326 6,321 -
9 E01 37 1,558,515 133,827 6,926 5,009 4,939 1%

Although the recent changes reduced the processing time in most cases, keeping the parent temporary file of carved items still provided a noticeable gain in some of them.

I will investigate the re-enqueue question raised by @lfcnassif.

lfcnassif commented 9 months ago

Thank you @tc-wleite for running all those tests again! Great to see recent changes were good.

About the re-enqueue situation, if there is no better solution, I think your changes are fine. You are right, if temp files are deleted because the item was re-enqueued, the behavior will be the same as the current one.

wladimirleite commented 9 months ago

About the re-enqueue situation, if there is no better solution, I think your changes are fine. You are right, if temp files are deleted because the item was re-enqueued, the behavior will be the same as the current one.

I will create a case to test the re-enqueue, and check for possible side effects (and maybe a better way to handle them).

wladimirleite commented 9 months ago

In the last weekend I ran another set of tests using other evidence files (from different real cases I am working on):

# Type Size (GB) Total Items Total Carved [Master] Processing Time (s) [Master] Temp Peak Usage (MB) [This PR] Processing Time (s) [This PR] Temp Peak Usage (MB) Processing Time Reduction
1 UFDR 15 447,015 52,573 994 2,818 882 3,020 11%
2 UFDR 15 1,212,414 29,188 1,124 4,232 1,029 4,057 8%
3 UFDR 16 1,125,428 29,758 2,343 6,243 1,061 6,233 55%
4 UFDR 36 2,706,816 9,219 3,308 12,247 3,306 11,752 -
5 UFDR 39 1,333,602 20,459 1,846 19,435 1,852 19,144 -
6 UFDR 47 1,846,854 47,096 2,149 11,342 2,087 11,411 3%
7 UFDR 117 2,209,701 56,071 3,568 39,229 3,580 38,808 -
8 E01 18 3,061 25 161 28,575 143 30,519 11%
9 E01 473 460,435 60,853 3,765 45,517 3,512 43,317 7%

This time I measured the peak space used in the IPED temporary folder. There was no significant increase. About the processing time, there was a small reduction (between 3% and 11%) in half of the cases. For the third UFDR, there was a major speed up. I repeated the processing for this UFDR, and confirmed the times shown in the table.

So I expect a small gain in most cases, and a significant speed up in a few cases. I already fixed a minor issue and plan to run a few more tests to ensure that the proposed changes do not cause any side effects.

lfcnassif commented 8 months ago

Hi @wladimirleite, thank you very much for running more tests! As I said before, I think this improvement is fine and can be merged as is.

An unrelated question: could you please check the UFDR processing logs for AccessDeniedException when processing WhatsApp databases and other files? A colleague of us is having that exception using a recent snapshot. I wasn't able to reproduce... I'm concerned if it could have been caused by recent improvements in UFDR temp files reusage and small files processing...

wladimirleite commented 8 months ago

Cool! I will try to finish a last test and send a PR in the next days. I haven't seen any AccessDeniedException's processing UFDRs with the master branch.

wladimirleite commented 8 months ago

I will create a case to test the re-enqueue, and check for possible side effects (and maybe a better way to handle them).

I ran a test in which many carved items needed to be re-enqueued, and it worked as expected, so I think this is ready for review.

lfcnassif commented 8 months ago

I ran a test in which many carved items needed to be re-enqueued, and it worked as expected, so I think this is ready for review.

Great, thank you very much @wladimirleite for this optimization and for all your tests!