Closed bjagg closed 7 years ago
Seems to be just referenced as a library, but after that never used in the code. Unless it is something needed by other libraries and these needs are defined in the jars, I feel that just remove the references to this library should not affect the product.
Said this, the complex thing is to be sure of that. Once I'm able to build with sbt I can try to remove this and view if it still builds. To test this in runtime will be more complex, but seems that this is imported in the admin console plugin and the conversion plugin. I'll try to figure how to test it. @cbeach47 , maybe you have a clue about how to test these 2 things.
@ddelblanco - I can definitely help out with the Admin Console, and can review how to test the conversion plugin. The Admin Console is a JNLP web start app launched from the Settings menu in the primary Equella web UI.
The dhfjava library is only referenced by the conversion service which I just added the sbt build for. The conversion service is an optional part of EQUELLA which we might just want to disable it for now as it basically only references libraries which aren't open source.
Office2Html is primarily (if not only) used when viewing older versions of Office files.
To enable server-side, in the optional-config.properties, ensure the following two lines are uncommented/configured:
conversionService.disableConversion = false
conversionService.conversionServicePath = /home/equella/equellaServer/conversion/conversion-service.jar
To disable server-side, in the optional-config.properties, it's a bit odd, but ensure the following line is uncommented / configured:
conversionService.disableConversion = true
and ensure the following line is commented out.
#conversionService.conversionServicePath = /eq-install-path/conversion/conversion-service.jar
To enable in the web UI for .doc, go to the mimetype (Settings > Mime Types) application/msword, and set as the default to 'HTML Conversion' and ensure the checkbox for Enabled is ticked.
Then upload a .doc to an item, and try to view the file.
Can you elaborate on what you meant by "offers a worse result"? Is it something we need to worry about?
Yes, the library is able to extract all the text and it looks good in a simple document. It maintains some of the format. But if the document has images, or complex tables or formats, the results are not very friendly. For instance, the images are not resized, so if a document has a small logo in a corner but the original image is big in the generated HTML the logo will be big. If the imagen was cropped in the doc, it will appear complete in the generated html. If the images are floating in the document... they appear in the HTML at the end of the file, and not in the original position...
This is with doc files. With PPT no images appear in the HTML generated, It only exports the text. In the xls files, it only exports the first sheet, and of course no graphics.
So... this solution makes the system to still work, and is able to generate a text preview of the files, but the quality of the preview is, unless we talk about simple documents, not good at all.
IMO, this is better than nothing, and it makes the system still work. It makes it to be open source. But it doesn't offer a professional result. So it can be there, and it can be something that we can try to improve in future once the open source work is finished.
Anyway, we need to remember that this is only for "doc" "xls" and "ppt" files, the OLD format from the office documents (supposedly deprecated 10 years ago...). Docx, xlsx and pptx are not managed by this converter because the original code is just for the old files. I'm sure that the tika library surely works better with the new formats.
Merged, closed
This is not open source