Upgrade to Tika 1.20 to fix the PDF full text extraction errors

lsitu commented 5 years ago

Fixes #82

Upgrade to Tika 1.20 to fix the PDF full text extraction errors.

@mcritchlow / @ucsdlib/developers Please review and comments

mcritchlow commented 5 years ago

@lsitu - were you able to confirm this fixes the processing errors for the PDFs @mdpeters referenced? If so, this is great :+1:

lsitu commented 5 years ago

@mcritchlow I just deployed to to staging and see different errors in QA.Please hold on it and I'll take a look:

java.lang.NoSuchFieldError: CONTENT_TYPE_OVERRIDE
    org.apache.tika.detect.OverrideDetector.detect(OverrideDetector.java:34)
    org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
    org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
    edu.ucsd.library.dams.api.DAMSAPIServlet.extractText(DAMSAPIServlet.java:2933)
    edu.ucsd.library.dams.api.FedoraAPIServlet.doGet(FedoraAPIServlet.java:343)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)

lsitu commented 5 years ago

@rstanonik I just deployed damsrepo to QA through the jenkins plan. It looks like that the deployment is not clean and I see the old jars of the older version there in webapp/dams/WEB-INF/lib. Do we need to delete the existing webapp/dams folder from Tomcat before we run the jenkins plan?

rstanonik commented 5 years ago

@lsitu It looks to me as if the jars in lib in QA match the jars in the feature/tika_1.20_upgrade branch.

Any chance you need to update the jars in the feature/tika_1.20_upgrade branch?

Here is how I compared them.

On lib-hydratail-qa

cd /tmp git clone https://github.com/ucsdlib/damsrepo.git cd damsrepo git checkout feature/tika_1.20_upgrade cd src mkdir arf cd arf cp -rl ../lib/ . cp -rl ../lib1/ . diff -r . /usr/local/tomcat/webapps/dams/WEB-INF/lib There was no difference.

lsitu commented 5 years ago

@rstanonik I think it looks good now and I don't see the stale jars in /usr/local/tomcat/webapps/dams/WEB-INF/lib. I don't see the PDF error in QA now. So we just need to make sure the stale jars are gone while moving forward to staging and prod. For the jars, I may need to add a new dependency jar ( Apache commons-collections 4.0) since I am seeing a problem with a .zip file full text extraction, though we don't need to extract text from .zip files:

java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils
    org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getEntry(ZipFileZipEntrySource.java:79)
    org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:251)
    org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:721)
    org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:198)
    org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:253)
    org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:173)
    org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:110)
    org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
    org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
    edu.ucsd.library.dams.api.DAMSAPIServlet.extractText(DAMSAPIServlet.java:2933)
    edu.ucsd.library.dams.api.FedoraAPIServlet.doGet(FedoraAPIServlet.java:343)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:728)

But we don't need to remove the /dams folder with new jars added. Thanks.

lsitu commented 5 years ago

@mcritchlow I don't see any errors on QA with the PDF object bb5157099g and the zip file now. So I think we are ready to merge the PR and deploy it to staging so that @mdpeters can test it there. Thanks.

mcritchlow commented 5 years ago

@lsitu that's great!

ucsdlib / damsrepo

Upgrade to Tika 1.20 to fix the PDF full text extraction errors #83