Solr error during file extraction. (NUFIAWEB-S) (STAGING)

kdid commented 7 years ago

We are having trouble with solr indexing rich content files (pdf's and docs).

The errors we are seeing in the logs:

Solr Extract service was unsuccessful. 
'http://nufiarepo-s.library.northwestern.edu:8983/solr/nufia/update/extract?extractOnly=true&wt=json&extractFormat=text' returned code 500 for /var/www/nufia/releases/20170124163752/tmp/uploads/sufia/uploaded_file/file/30/test_red.docx 
{"error":{"msg":"lazy loading error","trace":"org.apache.solr.common.SolrException: lazy loading error\n\tat 
...
more\nCaused by: java.lang.ClassNotFoundException: 
org.apache.solr.handler.extraction.ExtractingRequestHandler\n\tat

Not sure if this config in GitHub is current, but it seems the ExtractingRequestHandler is not getting loaded....

https://github.com/nulib/nufia-vagrant/blob/master/modules/nul_solr/files/nufia/conf/solrconfig.xml

 <!-- solr lib dirs -->
  <lib dir="../lib/contrib/analysis-extras/lib" />
  <lib dir="../lib/contrib/analysis-extras/lucene-libs" />
  <!-- for full-text indexing -->
  <lib dir="../lib/contrib/extraction/lib" regex=".*\.jar" />

<requestHandler name="/update/extract" startup="lazy" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
 </requestHandler>

Please make changes on STAGING only at this point, Thanks.

davidschober commented 7 years ago

Is this one we can just swap with a default? @d-venckus and @kdid

kdid commented 7 years ago

I would guess that probably is pretty close to the default. Are those jars available on those paths?

Scroll down to "Configuration". Do we have that set up correctly? https://wiki.apache.org/solr/ExtractingRequestHandler

kdid commented 7 years ago

If it's helpful here is the difference in a solr query local vs staging:

LOCAL:

curl "http://localhost:8983/solr/hydra-development/update/extract?extractOnly=true&wt=json&extractFormat=text" -F "myfile=@test.pdf"

{"responseHeader":{"status":0,"QTime":24},"test.pdf":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTest\n\n\n\n","test.pdf_metadata":["pdf:PDFVersion",["1.4"],"xmp:CreatorTool",["Writer"],"stream_content_type",["application/octet-stream"],"access_permission:modify_annotations",["true"],"access_permission:can_print_degraded",["true"],"dc:creator",["Karen Didrickson"],"dcterms:created",["2017-01-25T20:50:41Z"],"dc:format",["application/pdf; version=1.4"],"access_permission:fill_in_form",["true"],"stream_name",["test.pdf"],"pdf:encrypted",["false"],"Content-Type",["application/pdf"],"stream_size",["10961"],"X-Parsed-By",["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"],"creator",["Karen Didrickson"],"meta:author",["Karen Didrickson"],"meta:creation-date",["2017-01-25T20:50:41Z"],"stream_source_info",["myfile"],"created",["Wed Jan 25 20:50:41 UTC 2017"],"access_permission:extract_for_accessibility",["true"],"access_permission:assemble_document",["true"],"xmpTPg:NPages",["1"],"Creation-Date",["2017-01-25T20:50:41Z"],"access_permission:extract_content",["true"],"access_permission:can_print",["true"],"Author",["Karen Didrickson"],"producer",["LibreOffice 5.2"],"access_permission:can_modify",["true"]]}

STAGING:

[vagrant@nufiaweb-s testing] curl "http://nufiarepo-s.library.northwestern.edu:8983/solr/nufia/update/extract?extractOnly=true&wt=json&extractFormat=text" -F "myfile=@test.pdf"

{"error":{"msg":"lazy loading error","trace":"org.apache.solr.common.SolrException: lazy loading error\n\tat org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:262)\n\tat org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:244)\n\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1976)\n\tat org.apache.solr.servlet.SolrDispatchFilter.execute(SolrD

Also looking at query handler in solr interface local vs staging:

LOCAL: screen shot 2017-01-30 at 1 26 18 pm

STAGING: screen shot 2017-01-30 at 1 17 55 pm

davidschober commented 7 years ago

The local solr configs seem to work right. Possibly grab that.

d-venckus commented 7 years ago

The solution to this involves adding the additional Extraction libraries to the custom SOLR Puppet module I wrote. Not too much work now, from what I can see. Implementing this now and we can test today.

d-venckus commented 7 years ago

OK, this turned out to be a bit trickier than just adding lib paths for the additional Extraction libraries. There were a number of jar files missing that also needed to be copied over to /usr/local/solr/tomcat7/webapps/solr/WEB-INF/lib.

Extraction seems to be working now.

kdid commented 7 years ago

@d-venckus - Thank you . We will test it and let you know.

But the devs haven't been able to log in as vagrant on nufiaweb-s.library.northwestern.edu since Friday. Also Fedora/staging is down. I'll open separate issues.

kdid commented 7 years ago

@d-venckus, @davidschober -

We are still not clear of solr errors. We are still experiencing a problem with office docs. Are Tika's Apache POI dependencies in that lib/extraction folder? Can you list all the jars you're using in that directory? (listing all would be helpful so we can see versions).

You could test the error this way - from nufiaweb-s (replacing 'test.docx' with a test office document in your current directory.)

curl "http://nufiarepo-s.library.northwestern.edu:8983/solr/nufia/update/extract?extractOnly=true&wt=json&extractFormat=text" -F "myfile=@test.docx"

This is the error we're seeing:

"java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException","trace":"java.lang.RuntimeException: java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException\n\tat org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:793)\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java

(Again, issue #63 would be helpful for all of us devs so we can look at fedora/solr logs! Thanks!)

d-venckus commented 7 years ago

This might be where a solr version difference rears it's head: There is no openxml4j-xxx.jar in the solr 4.10 distribution, and also the poi-related jars that are there are differently named (ooxml-xxx.jar) It appears that I didn't see this error when I tested previously because I tested with a .doc file, and not a .docx.

Here are the jars I included from the extraction/lib folder:

-rw-r--r-- 1 root root 95536 Jan 7 2012 apache-mime4j-core-0.7.2.jar -rw-r--r-- 1 root root 304810 Jan 7 2012 apache-mime4j-dom-0.7.2.jar -rw-r--r-- 1 root root 116219 May 16 2011 aspectjrt-1.6.11.jar -rw-r--r-- 1 root root 229116 Jan 10 2010 bcmail-jdk15-1.45.jar -rw-r--r-- 1 root root 1663318 Jan 10 2010 bcprov-jdk15-1.45.jar -rw-r--r-- 1 root root 92027 Nov 3 2010 boilerpipe-1.1.0.jar -rw-r--r-- 1 root root 355794 Jan 16 2014 commons-compress-1.7.jar -rw-r--r-- 1 root root 313898 Aug 1 2005 dom4j-1.6.1.jar -rw-r--r-- 1 root root 205430 Jan 27 2014 fontbox-1.8.4.jar -rw-r--r-- 1 root root 9709288 Apr 1 2014 icu4j-53.1.jar -rw-r--r-- 1 root root 521237 Mar 24 2012 isoparser-1.0-RC-1.jar -rw-r--r-- 1 root root 153253 Aug 1 2005 jdom-1.0.jar -rw-r--r-- 1 root root 50980 Jan 27 2014 jempbox-1.8.4.jar -rw-r--r-- 1 root root 93310 Aug 2 2006 jhighlight-1.0.jar -rw-r--r-- 1 root root 220813 Sep 19 2011 juniversalchardet-1.0.3.jar -rw-r--r-- 1 root root 211185 Jul 19 2012 metadata-extractor-2.6.2.jar -rw-r--r-- 1 root root 3997494 Jan 27 2014 pdfbox-1.8.4.jar -rw-r--r-- 1 root root 1949542 Aug 18 2014 poi-3.10.1.jar -rw-r--r-- 1 root root 1035419 Aug 18 2014 poi-ooxml-3.10.1.jar -rw-r--r-- 1 root root 4946391 Aug 18 2014 poi-ooxml-schemas-3.10.1.jar -rw-r--r-- 1 root root 1239802 Aug 18 2014 poi-scratchpad-3.10.1.jar -rw-r--r-- 1 root root 208025 May 11 2007 rome-0.9.jar -rw-r--r-- 1 root root 90722 Aug 22 2011 tagsoup-1.2.1.jar -rw-r--r-- 1 root root 493374 Feb 9 2014 tika-core-1.5.jar -rw-r--r-- 1 root root 523677 Feb 9 2014 tika-parsers-1.5.jar -rw-r--r-- 1 root root 31659 Feb 9 2014 tika-xmp-1.5.jar -rw-r--r-- 1 root root 47478 Feb 3 2012 vorbis-java-core-0.1.jar -rw-r--r-- 1 root root 14752 Feb 3 2012 vorbis-java-tika-0.1.jar -rw-r--r-- 1 root root 1229125 Oct 1 2008 xercesImpl-2.9.1.jar -rw-r--r-- 1 root root 2730866 Aug 14 2012 xmlbeans-2.6.0.jar -rw-r--r-- 1 root root 117333 Jul 3 2012 xmpcore-5.1.2.jar -rw-r--r-- 1 root root 99234 Sep 22 2013 xz-1.4.jar

d-venckus commented 7 years ago

Not sure what to do here at this point. I don't know that I can insert a jar for the poi code that wasn't already distributed with SOLR, and expect to have it work. Anyone familiar with this?

davidschober commented 7 years ago

@d-venckus what are we looking at time-wise to upgrade SOLR to a supported/modern version?

d-venckus commented 7 years ago

@davidschober - Not entirely certain until I actually try doing it, and see what issues I run into. Right now, I'm going to try building an alternate and newer SOLR server in a vagrant box, and see what I run into.

d-venckus commented 7 years ago

Oh joy of joys. The SOLR developers abandoned the deployment of a solr war file as of version 5.2, and they discourage the installation of solr versions 5.x and up using any user-provided container. They seem to have a built-in jetty module. This is a different arrangement, and it may or may not work dependably under Tomcat. If we take the SOLR 6.3 release and install as recommended, then we need to rather severely rework the nufia_solr Puppet module that install and manages SOLR on our servers.

I'm going to see if I can get SOLR 6.3 working properly under Tomcat anyway. My Puppet module mostly installs the app, I just need to ensure that I'm manually deploying all the jar files and components that were once packaged inside the war file.

d-venckus commented 7 years ago

Information about the revised SOLR architecture and deployment issues is described here:

https://wiki.apache.org/solr/WhyNoWar

Ain't gonna study war no more...

davidschober commented 7 years ago

@d-venckus let me know now this went. Is changing how we run SOLR possible in the next day or two?

d-venckus commented 7 years ago

OK, fine for now. Regarding SOLR, it's not optimal to move to a newer version of the SOLR package (we've been using 4.10), since that opens up an entire can of worms on the server config end. We can certainly do this later. I'm searching for the missing jarfile that makes the docx-2-pdf conversions work. Missing the class: org/apache/poi/openxml4j, and this needs to be compatible with 4.10.

d-venckus commented 7 years ago

As discussed on Slack, I'll remove SOLR installation/configuration from Puppet for the time being, and will manually install SOLR v. 6.3, to comport with the new application requirements. I will rewrite the Puppet module later to re-assert management of the component, after Fair Use Week.

d-venckus commented 7 years ago

I've installed solr 6.3.0 onto nufiarepo-s, and configured it with the conf files from the i/r repo. I'm still working on altering the system startup scripts (solr doesn't account for systemd), but that should be quick. Can someone take a look at http://nufiarepo-s.library.northwestern.edu:8983/solr/ and tell me if I've missed anything?

Also, there's something of a bonus stemming from the change in deployment models that the solr devs are using. With a unitary deployment method, it makes managing a solr instance via Puppet or Chef a whole lot easier. Far fewer lines of custom code needed. I should have a good Puppet management class for solr working within a couple of days.

d-venckus commented 7 years ago

Just to add: I've tested the POI PDF Extraction issue using solr 6.3 that Karen noted above for .docx files under the old 4.10 solr, and it's now working as desired.

davidschober commented 7 years ago

Great news. Thanks @d-venckus .

davidschober commented 7 years ago

@kdid, @carrickr or @csyversen can you verify and close?

davidschober commented 7 years ago

Verifying and closing.

nulib / arch

Solr error during file extraction. (NUFIAWEB-S) (STAGING) #67