sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
948 stars 218 forks source link

Speed up thumbnail generation for formats supported by LibreOffice #1051

Open lfcnassif opened 2 years ago

lfcnassif commented 2 years ago

Some office files have embedded thumbnails, those could be used to speed up DocThumbTask for some formats.

PS: like we do for jpeg thumbs embedded in exif section

lfcnassif commented 2 years ago

@tc-wleite I'm not sure why you didn't integrate the idea/implementation of generating several thumbnails at the same time for many documents on #419 (code complexity? maybe because one timeout would affect several docs instead of one?), but I think we can try to do something like this: https://github.com/btimby/thumbnailer

To open a few LibreOffice connections and communicate with them using the java noa library already used by the LibreOfficeViewer. This would create one thumb at a time for each connection, but would save the LibreOffice start up time.

lfcnassif commented 2 years ago

Edited: Related, promising and Apache licensed: https://github.com/sbraconnier/jodconverter

wladimirleite commented 2 years ago

This seems a good idea! I didn't implement it initially because of the additional complexity. Making the current solution work already had a few challenges, as the interaction with LibreOffice is not well documented, and sometimes doesn't work as expected. I also tried to avoid long living processes (like something that would run during the whole IPED processing), but this may not be an issue, or may be implemented in a different way (like start, process a batch and terminate). I am not familiar with JODConverter, but taking a quick look, it may be a good option.

lfcnassif commented 2 years ago

may be implemented in a different way (like start, process a batch and terminate).

In the external parsing module, we restart each external parsing process after X documents are parsed, to clean possible resource leaks. Maybe we can also have an option to specify the number of LibreOffice processes in the conversion pool.

lfcnassif commented 2 years ago

Seems JODConverter is a very good option, simple and has a number of nice configuration parameters: https://github.com/sbraconnier/jodconverter/wiki/Java-Library https://github.com/sbraconnier/jodconverter/wiki/Configuration