ucsdlib / damsrepo

DAMS Repository
Other
4 stars 2 forks source link

PDF/Tika full text extraction errors #82

Closed mcritchlow closed 5 years ago

mcritchlow commented 5 years ago

During VRR testing, @mdpeters has noticed several objects with PDF/document files attached that are failing when Tika is trying to do full text extraction.

We need to determine whether this can be solved by a newer version of FITS/Tika or?

Example: https://notch8.slack.com/files/U045L1LF7/FGSGWKKHV/-.html

App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[35mSQL (0.7ms)[0m  UPDATE "users" SET "work_authorizations_count" = COALESCE("work_authorizations_count", 0) + 1 WHERE "users"."id" = $1  [["id", 184]]
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[36m (2.4ms)[0m  [1mCOMMIT[0m
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[35m (0.5ms)[0m  BEGIN
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[36m (0.5ms)[0m  [1mCOMMIT[0m
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[35m (0.5ms)[0m  BEGIN
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[36mSQL (0.5ms)[0m  [1mUPDATE "work_authorizations" SET "updated_at" = '2019-02-25 19:13:32.655751' WHERE "work_authorizations"."id" = $1[0m  [["id", 19]]
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650]   [1m[35m (1.6ms)[0m  COMMIT
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/rightsMetadata (27.0ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/rightsMetadata (51.8ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/damsMetadata (25.4ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/DC (25.4ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/DC (23.9ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_rdf.xml (24.8ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_rdf.xml (51.6ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_4.jpg (23.3ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_4.jpg (24.8ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_3.jpg (25.2ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_3.jpg (25.9ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_2.zip (27.4ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_2.zip (3096.9ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_7.jpg (29.5ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_7.jpg (111.6ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_5.jpg (29.6ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_5.jpg (43.5ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_6.jpg (26.1ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_6.jpg (76.3ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/_1.pdf (26.4ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream content bb5157099g/_1.pdf (103.5ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] Loaded datastream profile bb5157099g/fulltext_1.pdf (24.8ms)
App 9538 output: [librarytest] [397cd053-8c86-48c7-8c19-7d0f947c9650] <html><head><title>Apache Tomcat/7.0.40 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 500 - Servlet execution threw an exception</h1><HR size="1" noshade="noshade"><p><b>type</b> Exception report</p><p><b>message</b> <u>Servlet execution threw an exception</u></p><p><b>description</b> <u>The server encountered an internal error that prevented it from fulfilling this request.</u></p><p><b>exception</b> <pre>javax.servlet.ServletException: Servlet execution threw an exception
App 9538 output: </pre></p><p><b>root cause</b> <pre>java.lang.NoSuchMethodError: org.apache.pdfbox.pdmodel.PDPage.clear()V
App 9538 output:    org.apache.tika.parser.pdf.PDF2XHTML.endPage(PDF2XHTML.java:309)
App 9538 output:    org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:457)
App 9538 output:    org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
App 9538 output:    org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335)
App 9538 output:    org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
App 9538 output:    org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
App 9538 output:    org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
App 9538 output:    org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
App 9538 output:    org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
App 9538 output:    edu.ucsd.library.dams.api.DAMSAPIServlet.extractText(DAMSAPIServlet.java:2933)
App 9538 output:    edu.ucsd.library.dams.api.FedoraAPIServlet.doGet(FedoraAPIServlet.java:343)
App 9538 output:    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
App 9538 output:    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
App 9538 output: </pre></p><p><b>note</b> <u>The full stack trace of the root cause is available in the Apache Tomcat/7.0.40 logs.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/7.0.40</h3></body></html>
App 9538 output:   [1m[36m (0.8ms)[0m  [1mBEGIN[0m
App 9538 output:   [1m[35mSQL (0.7ms)[0m  UPDATE "work_authorizations" SET "error" = $1, "updated_at" = $2 WHERE "work_authorizations"."id" = $3  [["error", "See logger for details"], ["updated_at", "2019-02-25 19:13:37.400853"], ["id", 19]]
App 9538 output:   [1m[36m (2.0ms)[0m  [1mCOMMIT[0m
App 9538 output: {"method":"GET","path":"/dc/aeon/requests/32455/set_to_active","format":"html","controller":"Aeon::RequestsController","action":"set_to_active","status":302,"duration":7106.01,"view":0.0,"db":14.65,"location":"https://librarytest.ucsd.edu/dc/aeon/queues/70","@timestamp":"2019-02-25T19:13:37.409Z","@version":"1","message":"[302] GET /dc/aeon/requests/32455/set_to_active (Aeon::RequestsController#set_to_active)"}
mdpeters commented 5 years ago

Failing PDF objects in staging: https://librarytest.ucsd.edu/dc/object/bd3707665r https://librarytest.ucsd.edu/dc/object/bd4390265d https://librarytest.ucsd.edu/dc/object/bd8417403t (complex object that contains a PDF as well as audio)

lsitu commented 5 years ago

@mcritchlow I am unable to replicate the error locally in my MAC. I'll try to upgrade Tika to version 1.20 and upgrade PDFBox to latest 2.0.13 to see how it goes. Does it sounds good? @mdpeters Are all those PDFs in trouble containing image files with no fulltext at all? Will it work if adding text to the PDFs? Thanks.

mcritchlow commented 5 years ago

@lsitu - That sounds good to me :+1:

mdpeters commented 5 years ago

@lsitu - Only one of those PDF files should be image only (https://librarytest.ucsd.edu/dc/object/bd3707665r), I created two of those PDFs specifically for testing, one without text, on with, and the third we know has text as it's a transcript.

lsitu commented 5 years ago

@mcritchlow I think PR https://github.com/ucsdlib/damsrepo/pull/83 is ready to review now. Thanks.

lsitu commented 5 years ago

@gamontoya Could we create a new release for damsrepo so that @mdpeters can test it on staging? Thank you.

gamontoya commented 5 years ago

@lsitu What's the status of this ticket?

lsitu commented 5 years ago

@gamontoya It's done and I think we an close it.