gwu-libraries / gw-sufia

GWU Libraries Self-Deposit Prototype - based on Sufia 4
2 stars 1 forks source link

Add full-text indexing. #89

Closed mjgiarlo closed 10 years ago

mjgiarlo commented 10 years ago

fixes #80 refs projecthydra/sufia#550

mjgiarlo commented 10 years ago

Let's not merge this pull request until projecthydra/sufia#550 is merged and I point the Gemfile back at the main Sufia repo's master branch.

mjgiarlo commented 10 years ago

The build failed, too, because jetty did not spin up in time. ScholarSphere's jetty.yml is set to a 60s startup_wait -- perhaps we could bump up gw-sufia's value to 60s as well.

mjgiarlo commented 10 years ago

OK, build is now working. Working on the last bits of the Sufia PR. Hope to have that finalized tonight, so I can finish this PR tomorrow.

mjgiarlo commented 10 years ago

OK, projecthydra/sufia#550 has been merged, and the Gemfile in this #89 PR (fixes #80) now points at master. Assuming TravisCI passes, I believe the PR is now ready for review.

mjgiarlo commented 10 years ago

Build passes. This is ready for you, @kerchner @kilahimm.

kerchner commented 10 years ago

@mjgiarlo the expect(assigns(:document_list).count).to eq(1) assertion in the updated catalog_controller_spec test is coming up false. I did a bundle install to get the new gems, and ran rake sufia:jetty:config - is there anything else I need to do in order to get full-text indexing working?

mjgiarlo commented 10 years ago

​Where are you seeing the spec failure, @kerchner? It looks like the Travis build passed.

Did you pull this PR branch locally? Did you stop jetty before running sufia:jetty:config and then restart it? That task tweaks the jetty solr, so if you didn't restart it, you wouldn't pick up the new configs.

kerchner commented 10 years ago

@mjgiarlo yes, I checked out and pulled the PR branch locally (same steps as in the "command line" link below, to the left of the Merge button). Stopped jetty, rake'd sufia:jetty:config, restarted (as per the updated README in the branch).

mjgiarlo commented 10 years ago

@kerchner Are all the full-text extraction jars in jetty/solr/lib/contrib/extraction/lib ? And if you grep extraction solr_conf/conf/solrconfig.xml do you see a couple of relevant lines?

kerchner commented 10 years ago

@mjgiarlo I got the test to pass. Not sure what I was doing wrong before, but double checked gems, re-stopping/configing/starting jetty... and it works. Thanks again!

kerchner commented 10 years ago

@kilahimm reassigning the pull request to m2_001 as well (was not assigned to any milestone)

mjgiarlo commented 10 years ago

Huzzah!

kerchner commented 10 years ago

However... I'm not sure that I'm seeing it work in the app. Uploaded a pdf, and tried searching for some terms in the document... not getting any results. @mjgiarlo any ideas?

mjgiarlo commented 10 years ago

The best way to test it would be to drop into a Rails console, load the instance of GenericFile with the PDF attached, and call:

gf = GenericFile.find('ID_OF_FILE_WITH_PDF_ATTACHED')
gf.full_text.content
gf.to_solr['all_text_timv']

Line 2 of that snippet will show you the full-text content extracted from the PDF, and line 3 will show how it's been solrized.

Silly question, but assuming both of the above lines come up blank... it bears asking: does the PDF you're working with contain text content? Try uploading a Word document or a plain-text document, perhaps.

mjgiarlo commented 10 years ago

Do you need to restart your Rails server to pick up, e.g., the change to the catalog controller that adds "all_text_timv" to the blacklight config qf?

kerchner commented 10 years ago

I've restarted my rails server. But where is that config?

Attempting to get a handle on the file via the console (e.g.:

gf = GenericFile.find('g158c2374') )

, but getting errors back from Fedora, I think?

On Tue, Jul 8, 2014 at 1:01 PM, Michael J. Giarlo notifications@github.com wrote:

Do you need to restart your Rails server to pick up, e.g., the change to the catalog controller that adds "all_text_timv" to the blacklight config qf?

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48368467.

kerchner commented 10 years ago

Oh I see it now, in solrconfig.xml

On Tue, Jul 8, 2014 at 1:06 PM, Kerchner, Daniel kerchner@email.gwu.edu wrote:

I've restarted my rails server. But where is that config?

Attempting to get a handle on the file via the console (e.g.:

gf = GenericFile.find('g158c2374') )

, but getting errors back from Fedora, I think?

On Tue, Jul 8, 2014 at 1:01 PM, Michael J. Giarlo < notifications@github.com> wrote:

Do you need to restart your Rails server to pick up, e.g., the change to the catalog controller that adds "all_text_timv" to the blacklight config qf?

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48368467 .

mjgiarlo commented 10 years ago

Ah, OK. Try: gf = GenericFile.find('sufia:g158c2374') )

On Tue, Jul 8, 2014 at 10:06 AM, Dan Kerchner notifications@github.com wrote:

I've restarted my rails server. But where is that config?

Attempting to get a handle on the file via the console (e.g.:

gf = GenericFile.find('g158c2374') )

, but getting errors back from Fedora, I think?

On Tue, Jul 8, 2014 at 1:01 PM, Michael J. Giarlo < notifications@github.com> wrote:

Do you need to restart your Rails server to pick up, e.g., the change to the catalog controller that adds "all_text_timv" to the blacklight config qf?

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48368467.

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48369212.

kerchner commented 10 years ago

Much better! However, it doesn't look like full text indexing is taking place:

2.1.1 :005 > gf.full_text.content => nil 2.1.1 :006 > gf.to_solr['all_text_timv'] Loaded datastream content sufia:g158c2374/rightsMetadata (23.2ms) => nil 2.1.1 :007 > gf.full_text.content => nil

On Tue, Jul 8, 2014 at 1:10 PM, Michael J. Giarlo notifications@github.com wrote:

Ah, OK. Try: gf = GenericFile.find('sufia:g158c2374') )

On Tue, Jul 8, 2014 at 10:06 AM, Dan Kerchner notifications@github.com wrote:

I've restarted my rails server. But where is that config?

Attempting to get a handle on the file via the console (e.g.:

gf = GenericFile.find('g158c2374') )

, but getting errors back from Fedora, I think?

On Tue, Jul 8, 2014 at 1:01 PM, Michael J. Giarlo < notifications@github.com> wrote:

Do you need to restart your Rails server to pick up, e.g., the change to the catalog controller that adds "all_text_timv" to the blacklight config qf?

— Reply to this email directly or view it on GitHub < https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48368467>.

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48369212.

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48369691.

mjgiarlo commented 10 years ago

Can you try uploading a plain-text document? I just tested that locally and it worked OK. You can also use this document, which the specs are using: https://github.com/gwu-libraries/gw-sufia/blob/master/spec/fixtures/document4.pdf

kerchner commented 10 years ago

Plain-text and document4.pdf - same results in both the console and through the UI (searching on: cutePDF )

On Tue, Jul 8, 2014 at 1:18 PM, Michael J. Giarlo notifications@github.com wrote:

Can you try uploading a plain-text document? I just tested that locally and it worked OK. You can also use this document, which the specs are using: https://github.com/gwu-libraries/gw-sufia/blob/master/spec/fixtures/document4.pdf

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48370700.

mjgiarlo commented 10 years ago

OK, thanks for checking. Can you paste in the results of these two commands?

ls -l jetty/solr/lib/contrib/extraction/lib/
grep extraction jetty/solr/development-core/conf/solrconfig.xml
mjgiarlo commented 10 years ago

And, just to make doubly sure: your gw-sufia specs are passing?

kerchner commented 10 years ago

There are currently only 2 gw-sufia specs, but they are passing - including controllers/catalog_controller_spec.rb

On Tue, Jul 8, 2014 at 1:35 PM, Michael J. Giarlo notifications@github.com wrote:

And, just to make doubly sure: your gw-sufia specs are passing?

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48373006.

mjgiarlo commented 10 years ago

I believe there should be 3 specs in gw-sufia: 2 in the controller spec and 1 in the view spec. Are you only seeing 2 tests? If so, something is definitely not up to date with your instance!

Travis sees three as well: https://travis-ci.org/gwu-libraries/gw-sufia/builds/29422712

kerchner commented 10 years ago

Sorry, there are 2 specs in the one catalog_controller_spec.rb file. As per this version --> https://github.com/gwu-libraries/gw-sufia/blob/b4de162a5f95907b04913e0dfad1c36d65c61ca5/spec/controllers/catalog_controller_spec.rb

On Tue, Jul 8, 2014 at 3:57 PM, Michael J. Giarlo notifications@github.com wrote:

I believe there should be 3 specs in gw-sufia: 2 in the controller spec and 1 in the view spec. Are you only seeing 2 tests? If so, something is definitely not up to date with your instance!

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48392025.

mjgiarlo commented 10 years ago

OK. Can you paste the results per this comment? https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48372878

kerchner commented 10 years ago
kerchner@gwdev-kerchner:~/projects/gw-sufia (master)$ ls -l jetty/solr/lib/contrib/extraction/lib/
total 33056
-rw-rw-r-- 1 kerchner kerchner   95536 Jun 30 15:58 apache-mime4j-core-0.7.2.jar
-rw-rw-r-- 1 kerchner kerchner  304810 Jun 30 15:58 apache-mime4j-dom-0.7.2.jar
-rw-rw-r-- 1 kerchner kerchner  229116 Jun 30 15:58 bcmail-jdk15-1.45.jar
-rw-rw-r-- 1 kerchner kerchner 1663318 Jun 30 15:58 bcprov-jdk15-1.45.jar
-rw-rw-r-- 1 kerchner kerchner   92027 Jun 30 15:58 boilerpipe-1.1.0.jar
-rw-rw-r-- 1 kerchner kerchner  241367 Jun 30 15:58 commons-compress-1.4.1.jar
-rw-rw-r-- 1 kerchner kerchner  313898 Jun 30 15:58 dom4j-1.6.1.jar
-rw-rw-r-- 1 kerchner kerchner  185566 Jun 30 15:58 fontbox-1.7.0.jar
-rw-rw-r-- 1 kerchner kerchner 7407144 Jun 30 15:58 icu4j-49.1.jar
-rw-rw-r-- 1 kerchner kerchner  521237 Jun 30 15:58 isoparser-1.0-RC-1.jar
-rw-rw-r-- 1 kerchner kerchner  153253 Jun 30 15:58 jdom-1.0.jar
-rw-rw-r-- 1 kerchner kerchner   51088 Jun 30 15:58 jempbox-1.7.0.jar
-rw-rw-r-- 1 kerchner kerchner  220813 Jun 30 15:58 juniversalchardet-1.0.3.jar
-rw-rw-r-- 1 kerchner kerchner   90929 Jun 30 15:58 metadata-extractor-2.4.0-beta-1.jar
-rw-rw-r-- 1 kerchner kerchner 4326608 Jun 30 15:58 netcdf-4.2-min.jar
-rw-rw-r-- 1 kerchner kerchner 3908404 Jun 30 15:58 pdfbox-1.7.0.jar
-rw-rw-r-- 1 kerchner kerchner 1820323 Jun 30 15:58 poi-3.8.jar
-rw-rw-r-- 1 kerchner kerchner  933010 Jun 30 15:58 poi-ooxml-3.8.jar
-rw-rw-r-- 1 kerchner kerchner 4706775 Jun 30 15:58 poi-ooxml-schemas-3.8.jar
-rw-rw-r-- 1 kerchner kerchner 1186887 Jun 30 15:58 poi-scratchpad-3.8.jar
-rw-rw-r-- 1 kerchner kerchner  208025 Jun 30 15:58 rome-0.9.jar
-rw-rw-r-- 1 kerchner kerchner   29813 Jun 30 15:58 solr-cell-4.0.0.jar
-rw-rw-r-- 1 kerchner kerchner   90722 Jun 30 15:58 tagsoup-1.2.1.jar
-rw-rw-r-- 1 kerchner kerchner  463945 Jun 30 15:58 tika-core-1.2.jar
-rw-rw-r-- 1 kerchner kerchner  482074 Jun 30 15:58 tika-parsers-1.2.jar
-rw-rw-r-- 1 kerchner kerchner   47478 Jun 30 15:58 vorbis-java-core-0.1.jar
-rw-rw-r-- 1 kerchner kerchner   14752 Jun 30 15:58 vorbis-java-tika-0.1.jar
-rw-rw-r-- 1 kerchner kerchner 1229125 Jun 30 15:58 xercesImpl-2.9.1.jar
-rw-rw-r-- 1 kerchner kerchner 2666695 Jun 30 15:58 xmlbeans-2.3.0.jar
-rw-rw-r-- 1 kerchner kerchner   94672 Jun 30 15:58 xz-1.0.jar

kerchner@gwdev-kerchner:~/projects/gw-sufia (master)$ grep extraction jetty/solr/development-core/conf/solrconfig.xml
  <lib dir="../lib/contrib/extraction/lib" regex=".*\.jar" />
<requestHandler name="/update/extract" startup="lazy" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" >
mjgiarlo commented 10 years ago

Is this happening on an instance I can ssh into and poke at?

kerchner commented 10 years ago

Unfortunately not - it's on my personal dev server, but I'll work tonight on setting up one of the shared instances to use the full text indexing. On Jul 10, 2014 10:20 PM, "Michael J. Giarlo" notifications@github.com wrote:

Is this happening on an instance I can ssh into and poke at?

— Reply to this email directly or view it on GitHub https://github.com/gwu-libraries/gw-sufia/pull/89#issuecomment-48687745.