Closed lsitu closed 5 years ago
@lsitu - were you able to confirm this fixes the processing errors for the PDFs @mdpeters referenced? If so, this is great :+1:
@mcritchlow I just deployed to to staging and see different errors in QA.Please hold on it and I'll take a look:
java.lang.NoSuchFieldError: CONTENT_TYPE_OVERRIDE
org.apache.tika.detect.OverrideDetector.detect(OverrideDetector.java:34)
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:112)
edu.ucsd.library.dams.api.DAMSAPIServlet.extractText(DAMSAPIServlet.java:2933)
edu.ucsd.library.dams.api.FedoraAPIServlet.doGet(FedoraAPIServlet.java:343)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
@rstanonik I just deployed damsrepo to QA through the jenkins plan. It looks like that the deployment is not clean and I see the old jars of the older version there in webapp/dams/WEB-INF/lib
. Do we need to delete the existing webapp/dams
folder from Tomcat before we run the jenkins plan?
@lsitu It looks to me as if the jars in lib in QA match the jars in the feature/tika_1.20_upgrade branch.
Any chance you need to update the jars in the feature/tika_1.20_upgrade branch?
Here is how I compared them.
On lib-hydratail-qa
cd /tmp git clone https://github.com/ucsdlib/damsrepo.git cd damsrepo git checkout feature/tika_1.20_upgrade cd src mkdir arf cd arf cp -rl ../lib/ . cp -rl ../lib1/ . diff -r . /usr/local/tomcat/webapps/dams/WEB-INF/lib There was no difference.
@rstanonik I think it looks good now and I don't see the stale jars in /usr/local/tomcat/webapps/dams/WEB-INF/lib. I don't see the PDF error in QA now. So we just need to make sure the stale jars are gone while moving forward to staging and prod. For the jars, I may need to add a new dependency jar ( Apache commons-collections 4.0) since I am seeing a problem with a .zip file full text extraction, though we don't need to extract text from .zip files:
java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils
org.apache.poi.openxml4j.util.ZipFileZipEntrySource.getEntry(ZipFileZipEntrySource.java:79)
org.apache.poi.openxml4j.opc.ZipPackage.getPartsImpl(ZipPackage.java:251)
org.apache.poi.openxml4j.opc.OPCPackage.getParts(OPCPackage.java:721)
org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:198)
org.apache.tika.parser.pkg.ZipContainerDetector.detectOPCBased(ZipContainerDetector.java:253)
org.apache.tika.parser.pkg.ZipContainerDetector.detectZipFormat(ZipContainerDetector.java:173)
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:110)
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
edu.ucsd.library.dams.api.DAMSAPIServlet.extractText(DAMSAPIServlet.java:2933)
edu.ucsd.library.dams.api.FedoraAPIServlet.doGet(FedoraAPIServlet.java:343)
javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
But we don't need to remove the /dams folder with new jars added. Thanks.
@mcritchlow I don't see any errors on QA with the PDF object bb5157099g and the zip file now. So I think we are ready to merge the PR and deploy it to staging so that @mdpeters can test it there. Thanks.
@lsitu that's great!
Fixes #82
Upgrade to Tika 1.20 to fix the PDF full text extraction errors.
@mcritchlow / @ucsdlib/developers Please review and comments