sciencehistory / chf-sufia

sufia-based hydra app
Other
9 stars 4 forks source link

Switch our text_en fields to use the "porter" stemmer instead of the "EnglishMinimal" #480

Closed jrochkind closed 7 years ago

jrochkind commented 7 years ago

@MDiMeo @catlu

Known issue: This may do very weird things to non-English test, these stemming algorithms are always language-specific, and the more aggressive "porter" stemmer will be even weirder for non-English.

jrochkind commented 7 years ago

Something else I've done in the past is set up Solr so exact matches are ranked higher than "stemmed" matches.

This can be done, but is some additional work. It is not done yet now.

jrochkind commented 7 years ago

Examples of aggressive "Porter" stemming (used Solr admin analysis tool to investigate)

I think the "Porter" stemmer is rule-based, rather than dictionary based, and tries to apply some basic rules to stem words to their 'root'. It is definitely not always going to do what a human might think best, sometimes matching things that one would think ought not to be matched, and other times not matching things one would like to match.

jrochkind commented 7 years ago

Okay, aggressive stemming using "Porter" is live at staging.hydra.chemheritage.org.

An example search which shows maybe pro's and con's is: << trading show >>

Note the first four results all have "Trade Show" in the title. Because of aggressive stemming, the query "trading show" matched.

However, note the fifth result: "Detail view of blast furnaces showing feed buckets ascending". Why did this match? It doesn't actually have the word "trading" or "show" in it anywhere. But, it has the word "trade" in one place, and the word "showing" in two others. Because of aggressive stemming, this is a match.

So there's potential pluses and minuses here, increasing recall but lowering precision. Relevancy tweaks often run into this, improving some searches or searches in some ways, risks harming others or in other ways.

Boosting exact matches over stemmed matches is possible, and may ameliorate things somewhat.

jrochkind commented 7 years ago

Search for socialism brings up hits with just social in them, in some cases above hits with socialism. I think we've decreased the precision too much with this set up as it is.

sanfordd commented 7 years ago

I'll second the social/socialism problem (staging gives each the same results, while production is more precise).

On larger sets of objects, the difference seems to be smaller.

I used electronic which got 206 on Production and 222 on Staging; the stemming on staging pushes up the Electron Inertia Apparatus and Electron Microscope to the front page which I'm a bit torn on if it is useful or bit of a distraction. Overall I think this is beneficial.

Medicine shows a 1 object difference on the two (staging has 54 to production's 53) due to a change in Production, but otherwise they get the same results.

Alchemy grabs 22 on staging versus 17 on production, primarily paintings. Oddly the paintings do have alchemy as a subject and are being missed on production. This might be related to the same production change on medicine with a few objects having their status changed. If so, then the results end up the same.

Valence pulls the same 6 objects up on both.

I did a test with eugenics, production doesn't generate any results but staging's stemming grabs the name Eugene with eugenics. That's definitely not helpful.

Interestingly on staging gene doesn't pull up anything from genetic, nor does genetic go back to gne. Not sure why that's the case.

Organic works on production, but on staging it grabs organization as well. That certainly doesn't help make the results better.

Overall I agree with @jrochkind that this sacrifices too much precision, a few words spin off into some highly unrelated territory.

catlu commented 7 years ago

typed in "ad" on both prod and staging, and both bring up advertisements but staging's first result is for "adding" in title

Biochem brings up 0 results on Prod, 4 on Staging related to biochemical.

Experiment brings up 110 on Prod, 112 on Staging, but the second result on Staging is a hit for experimenting so less precise there.

Lab brings up 64 results in both with no difference.

Flower results in an extra hit from Staging for flowering.

Industry has 870 results in Prod, 912 in Staging, but Staging has moved hits for industrial up near the top.

I'm inclined to agree with both @jrochkind and @sanfordd here, as I have a preference for more exact matches, but boosting exact matches over the stemming (with stemming on) may be a better adjustment.

archivistsarah commented 7 years ago

For the most part, my searches don't bring up very different results on production vs staging.

Beckman = 868 prod; 867 staging. Staging gives me a less-relevant photo of a building high in the results (prod doesn't have this in the 1st 10 results).

pH = 235 in both. The order differs once you get 30+ results back, but nothing unexpected or confusing.

spectro = 0 results in both

spectrometry = 16 results in both; order is the same

portraits of scientists = 190 results in both; nothing unexpected in the orders

MDiMeo commented 7 years ago

Most of my searches (book wheel, Boyle, Pharma*, Cornell University) produced the same results with helpful relevancy rankings. The title having more weight than others and having indexed the additional fields has really helped me find what I expect. I also think we should somehow highlight the wild card option.

Two searches produced different results: "Glass Blowing" resulted in the same 32 objects in both, but they were in a different order. Staging may have been slightly more relevant here because some of the photos of people actually glass blowing were higher than the vase. But honestly they all seemed relevant and the differences between the ordering was minor.

"Scientific Education" had 3 on production and 8 on staging. The 5 extras that I got on staging were not relevant. For example, I got a letter from Arnold Beckman that wasn't related to scientific education because Scientific apparatus was the subject and it mentioned in the description that Noyes was educated in Germany. This seems to me a good example of stemming gone wrong.

jrochkind commented 7 years ago

Decided not to do for now.