Closed wladimirleite closed 3 years ago
I finished the implementation of this feature. I am adding it as an experimental and disabled by default feature, as it has some known limitations/issues, which I wasn't able to overcome. But it seems to be useful (and with no side effects). My last test used 1.4M files, collected from hundreds of different cases, including the following formats:
Relevant issues I found:
I was able to find solutions to avoid (or at least minimize) some of these issues:
Some thoughts:
One change that helped (reduced the average time to ~50%) was to use batch of items, instead of stating a conversion process for each item. The code became more complex after this change, and there is a risk that a timeout would affect other items in the same batch, but as the gain was relevant, I decided to keep it, using a batch size of 10 (larger batches didn't help). It can be changed in the configuration file.
When running more tests with real forensic images (tests before were focused in sets with PDFs and LibreOffice supported files only), I noticed that the gain of using batches when calling LibreOffice document to image conversion was a bit smaller (~10-20% reduction of average processing time). It also depends on the file length, with minor performance gains for larger files. As mentioned before, the error/timeout control is more complex when using batches, so I decided to revert this, and for this initial implementation remove the usage of batches.
Just a curiosity, ~10-20% decrease in DocThumbTask time or overall processing time?
Just a curiosity, ~10-20% decrease in DocThumbTask time or overall processing time?
Just in DocThumbTask. In previous tests (using only LibreOffice documents and most processing options disabled) there was a ~50% decrease in DocThumbTask (which was almost all the processing time), when using batches.
This time I used 12 HDD images from 3 different cases, and most of processing options enabled. So DocThumbTask took a reasonable amount of time (5-25% of processing time). It is worth noting that none of these disks have a massive collection of Office documents.
Closed by #491
Related to #119.