Closed mcritchlow closed 5 years ago
Failing PDF objects in staging: https://librarytest.ucsd.edu/dc/object/bd3707665r https://librarytest.ucsd.edu/dc/object/bd4390265d https://librarytest.ucsd.edu/dc/object/bd8417403t (complex object that contains a PDF as well as audio)
@mcritchlow I am unable to replicate the error locally in my MAC. I'll try to upgrade Tika to version 1.20 and upgrade PDFBox to latest 2.0.13 to see how it goes. Does it sounds good? @mdpeters Are all those PDFs in trouble containing image files with no fulltext at all? Will it work if adding text to the PDFs? Thanks.
@lsitu - That sounds good to me :+1:
@lsitu - Only one of those PDF files should be image only (https://librarytest.ucsd.edu/dc/object/bd3707665r), I created two of those PDFs specifically for testing, one without text, on with, and the third we know has text as it's a transcript.
@mcritchlow I think PR https://github.com/ucsdlib/damsrepo/pull/83 is ready to review now. Thanks.
@gamontoya Could we create a new release for damsrepo so that @mdpeters can test it on staging? Thank you.
@lsitu What's the status of this ticket?
@gamontoya It's done and I think we an close it.
During VRR testing, @mdpeters has noticed several objects with PDF/document files attached that are failing when Tika is trying to do full text extraction.
We need to determine whether this can be solved by a newer version of FITS/Tika or?
Example: https://notch8.slack.com/files/U045L1LF7/FGSGWKKHV/-.html