gwu-libraries / gw-sufia

GWU Libraries Self-Deposit Prototype - based on Sufia 4
2 stars 1 forks source link

Add full text index from ScholarSphere to Sufia #80

Closed kilahimm closed 10 years ago

kilahimm commented 10 years ago

Port or enable full-text indexing for uploads in Sufia from Scholarsphere

kilahimm commented 10 years ago

Mike is there anything in Scholarsphere that we can look at for this?

mjgiarlo commented 10 years ago

Yep. Here are the changes:

First we tweaked the solr schema to include the all_text_timv field. Looks like that's already in gw-sufia.

Then you add all_text_timv to the qf in solrconfig.xml.

Declare a full_text datastream in your GenericFile model:

https://github.com/psu-stewardship/scholarsphere/blob/develop/app/models/generic_file.rb#L20

Exclude the full_text datastream from calls to #per_version:

https://github.com/psu-stewardship/scholarsphere/blob/develop/app/models/generic_file.rb#L47

Add the #extract_content method to the GenericFile model:

https://github.com/psu-stewardship/scholarsphere/blob/develop/app/models/generic_file.rb#L95

Add full_text content to the GenericFile's solr document:

https://github.com/psu-stewardship/scholarsphere/blob/develop/app/models/generic_file.rb#L119

Override the #characterize method to make sure #extract_content is called:

https://github.com/psu-stewardship/scholarsphere/blob/develop/app/models/generic_file.rb#L41

That's how we went about doing this in ScholarSphere anyway. I'd be in support of this moving into Sufia proper.

kilahimm commented 10 years ago

Mike could you take this on as a ticket? I'm going to add it as the first issue in m3_001

mjgiarlo commented 10 years ago

Sure!

mjgiarlo commented 10 years ago

Submitted projecthydra/sufia#550.

I'll point gw-sufia at that branch and test integration this weekend.

kerchner commented 10 years ago

@kilahimm - reassigning to m2_001 (was m3_001)

kerchner commented 10 years ago

Reopening - Spec test works, but full-text indexing is not returning any results in the app.

mjgiarlo commented 10 years ago

I'll be taking a look at this before long. Apologies for the delay!

kerchner commented 10 years ago

Closing - this works fine, required a restart of the resque workers.

mjgiarlo commented 10 years ago

@kerchner does this imply we may need to improve the documentation a bit so that downstream users know they need to restart workers to pick up this functionality?

kerchner commented 10 years ago

@mjgiarlo I considered that, but I don't feel like I have a solid enough grasp on why resque needs to be restarted, and whether that is or is not always the case. I might just add this to our local documentation, since it might be unnecessary and/or only needed when upgrading from earlier versions of the code. Based on your knowledge of how you implemented it here, does it make sense that it would need to be restarted, or could it just have been due to other variables (changing code base, etc.)?