Open lfcnassif opened 2 years ago
@tc-wleite I'm not sure why you didn't integrate the idea/implementation of generating several thumbnails at the same time for many documents on #419 (code complexity? maybe because one timeout would affect several docs instead of one?), but I think we can try to do something like this: https://github.com/btimby/thumbnailer
To open a few LibreOffice connections and communicate with them using the java noa library already used by the LibreOfficeViewer. This would create one thumb at a time for each connection, but would save the LibreOffice start up time.
Edited: Related, promising and Apache licensed: https://github.com/sbraconnier/jodconverter
This seems a good idea! I didn't implement it initially because of the additional complexity. Making the current solution work already had a few challenges, as the interaction with LibreOffice is not well documented, and sometimes doesn't work as expected. I also tried to avoid long living processes (like something that would run during the whole IPED processing), but this may not be an issue, or may be implemented in a different way (like start, process a batch and terminate). I am not familiar with JODConverter, but taking a quick look, it may be a good option.
may be implemented in a different way (like start, process a batch and terminate).
In the external parsing module, we restart each external parsing process after X documents are parsed, to clean possible resource leaks. Maybe we can also have an option to specify the number of LibreOffice processes in the conversion pool.
Seems JODConverter is a very good option, simple and has a number of nice configuration parameters: https://github.com/sbraconnier/jodconverter/wiki/Java-Library https://github.com/sbraconnier/jodconverter/wiki/Configuration
Some office files have embedded thumbnails, those could be used to speed up DocThumbTask for some formats.
PS: like we do for jpeg thumbs embedded in exif section