sciencehistory / chf-sufia

sufia-based hydra app
Other
9 stars 4 forks source link

Relevance tuning #310

Closed hackartisan closed 7 years ago

hackartisan commented 7 years ago

(not for now: Maybe boost collections and parent works over other stuff. )

catlu commented 7 years ago

Initial thoughts: Title and alternate title ranked highest; then creator (I'd want Louis Pasteur notes to show up before stamps about him) and genre (I'd want things that are scientific instruments to show up before things about them); subjects; description; rest of fields (can't currently think of medium, place, language, or department keywords that should be weighted more than description).

It would also be cool if collections and parent works came up higher than the rest of the works, but not sure if that fits in here.

jrochkind commented 7 years ago

Shall I work on this to @catlu 's "initial thoughts", or wait for more discussion?

hackartisan commented 7 years ago

@jrochkind go for it!

jrochkind commented 7 years ago

Re: Make sure stemming is happening in the right places

I think our search fields are all *_tesim, which is type text_en.

text_en uses the EnglishMinimalStemFilterFactory

"EnglishMinimalStemFilterFactory is less aggressive than PorterStemFilterFactory" . EnglishMinimal seems to be a newer addition to Solr than PorterStem, so perhaps is preferred as a basic default? Something generated it into our schema.xml, so someone chose it, whether Solr or BL/Hydra. I can't find any docs on the differences what they actually do.

But stemming does seem to be happening according to config. But I can't actually find any searches that demonstrate this. Not sure why stemming isn't happening if it's not.

Should we try the more aggressive PorterStemFilterFactory and see if we like it better? I think I've used PorterStemFilterFactory in the past, I don't think EnglishMinimal existed when I worked on this previously. Would require a solr restart.

@hackmastera

jrochkind commented 7 years ago

Re: "Review stopwords list -- i.e. do we have one?"

It looks to me like we do not, and that that is fine. I just searched for "and" as a query, no problem, and got proper results. I think that's fine.

Check that checkbox off @hackmastera ?

hackartisan commented 7 years ago

@jrochkind Sure, give it a shot. I think I have heard people complain about not getting the stemming behavior they expect, although it's possible that was a conflation of stemming with the fact that almost nothing was getting searched.

Also, 👍 re: stopwords

jrochkind commented 7 years ago

You can investigate stemming in the Solr 'analysis' thing, it will show you what gets stemmed... I can't find anything that gets stemmed, nothing I enter gets stemmed, doh!

http://hydra.chemheritage.org:8983/solr/#/collection1/analysis?analysis.fieldvalue=description&analysis.fieldtype=text_en&verbose_output=1

jrochkind commented 7 years ago

Ah, EnglishMinimal does take off plurals. Sometimes. It removes the 's' to normalize singular/plural. It does do something. I think the more aggressive porter one is what people want.

Will probably require a reindex with downtime. After the solr type is changed, the index will behave unpredictably until reindex. (If we had master/slave solr index, this would be easier to do without downtime).

jrochkind commented 7 years ago

The PorterStemmer is going to do weird things to non-English text though. We can try it on staging I guess. Maybe we should wait until we have other relevancy things tuned a bit more, so people trying it with Porter can have a better starting place to see if it makes things terrible. And get some German-language speakers to enter queries, if we care about that. (We've got a lot of German I think).

In an ideal world, we'd have our text strings tagged as to language, so we could do the proper stemming and other analysis based on language... but we don't.

hackartisan commented 7 years ago

Oh yeah I forgot about solr's analysis tool!

In general where we have other languages it is probably just titles (@catlu can correct me if I'm wrong) and they'll have translated titles as well. I think it's okay to treat this as an English-language catalog. We can also include a couple of German title searches in our formal eval to see how it goes.

Not sure what you mean by "other relevancy things tuned a bit more" -- you mean do a round of eval before changing the stemming?

jrochkind commented 7 years ago

I think I basically meant merging and deploying #457. If that's on production, and we do equivalent but with PorterStemFilter factory on staging, then the two can be easily compared.

Is there any easy way to refresh staging with production data, @sanfordd ?

hackartisan commented 7 years ago

I propose moving the last check box to its own issue and discussing whether it might be something we could do between soft / hard launch with other user testing

Update: formal eval moved to #494