sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
969 stars 219 forks source link

Create thumbnails for formats supported by LibreOffice #419

Closed wladimirleite closed 3 years ago

wladimirleite commented 3 years ago

Related to #119.

wladimirleite commented 3 years ago

I finished the implementation of this feature. I am adding it as an experimental and disabled by default feature, as it has some known limitations/issues, which I wasn't able to overcome. But it seems to be useful (and with no side effects). My last test used 1.4M files, collected from hundreds of different cases, including the following formats:

Relevant issues I found:

  1. LibreOffice creates a lot of temporary files in user Temp folder (usually located in "\Users\\AppData\Local\Temp" on Windows), and these files may stay the conversion (Document -> Image) doesn't finish (time out or a fatal error).
  2. Files with invalid signatures (e.g. a delete .DOC which content was overwritten) are opened by the default application (LibreOfficer Writer), taking a lot of time, and creating a document with garbage (or time outing).
  3. The conversion process is slow.
  4. There is some system level restraint / synchronization, that prevents all CPU power to be used.

I was able to find solutions to avoid (or at least minimize) some of these issues:

  1. Found a way to set the temp folder to IPED's temp folder, changing the user profile files created for each LibreOffice instance, solving this issue.
  2. Items that caused a parserException or that are file fragments are skipped. This doesn't solve this issue, but avoids it for the most common (and easy to detect) situations.
  3. Unfortunately, it is very slow indeed. Tried several configurations (set in user profile), but they didn't provide any meaningful improvement. One change that helped (reduced the average time to ~50%) was to use batch of items, instead of stating a conversion process for each item. The code became more complex after this change, and there is a risk that a timeout would affect other items in the same batch, but as the gain was relevant, I decided to keep it, using a batch size of 10 (larger batches didn't help). It can be changed in the configuration file.
  4. I couldn't find a workaround for this issue. During the processing, if all workers are running DocThumbTask, the CPU doesn't reach its full usage. It doesn't seem to be a disk related issue, as I didn't observe any long disk queue or high disk usage. In my tests though, most IPED tasks were disabled. In a real case, this issue may be less critical, as there will be other tasks running, and the issue seems related to many LibreOffice instances running conversions at the same time.
lfcnassif commented 3 years ago

Some thoughts:

  1. In ImageThumbTask, there is a background thread that tries to clean temp files left behind (without open handles), sometimes timed out vetorial images used to leave temp files with GBs of size, not sure if this approach would help.
  2. I think #401 will help this
wladimirleite commented 3 years ago

One change that helped (reduced the average time to ~50%) was to use batch of items, instead of stating a conversion process for each item. The code became more complex after this change, and there is a risk that a timeout would affect other items in the same batch, but as the gain was relevant, I decided to keep it, using a batch size of 10 (larger batches didn't help). It can be changed in the configuration file.

When running more tests with real forensic images (tests before were focused in sets with PDFs and LibreOffice supported files only), I noticed that the gain of using batches when calling LibreOffice document to image conversion was a bit smaller (~10-20% reduction of average processing time). It also depends on the file length, with minor performance gains for larger files. As mentioned before, the error/timeout control is more complex when using batches, so I decided to revert this, and for this initial implementation remove the usage of batches.

lfcnassif commented 3 years ago

Just a curiosity, ~10-20% decrease in DocThumbTask time or overall processing time?

wladimirleite commented 3 years ago

Just a curiosity, ~10-20% decrease in DocThumbTask time or overall processing time?

Just in DocThumbTask. In previous tests (using only LibreOffice documents and most processing options disabled) there was a ~50% decrease in DocThumbTask (which was almost all the processing time), when using batches.

This time I used 12 HDD images from 3 different cases, and most of processing options enabled. So DocThumbTask took a reasonable amount of time (5-25% of processing time). It is worth noting that none of these disks have a massive collection of Office documents.

lfcnassif commented 3 years ago

Closed by #491