search stemming is overzealous

alexduryee commented 2 weeks ago

Following up from the November 4 community call, there was discussion around search term stemming in Solr, and how it's currently too aggressive. Users have found that the following terms are getting buried due to stemming:

eugenics matches eugene
organs matches organization There's no way for the user to search exact terms without stemming, since quotation marks only group phrases and won't bypass stemming.

Duke's approach to this was to include an unstemmed index field (https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml#L133), which is weighted above the stemmed ones.

Questions to discuss:

Do alternative Solr stemmers provide a better search experience?
Does Duke's approach meet user expectations? Are terms still being buried?
How important is exact-term searching via quotation marks? Can that be implemented?

corylown commented 2 weeks ago

Some initial responses:

Solr includes a variety of stemming options. It could be worth investigating whether switching to one of these different stemming strategies improves things. I think we've used EnglishMinimalStemFilterFactory in other projects and it's less aggressive.
Duke's approach -- indexing both stemmed and unstemmed copies of fields and giving a boost to the unstemmed matches is the typical approach to this problem. The person searching doesn't have to know any specific querying techniques for it to work and relevance ranking pushes unstemmed matches to the top of the results and less exact matches further down. I think we should consider implementing this strategy in ArcLight. There may also be fields that are being stemmed that we should stop stemming altogether (for example, any fields for names).
Quotes indicate a phrase query to Solr's query parser. I'm wary of trying to implement something in ArcLight that would try to use quotes in a query to mean something different from what Solr expects. You'd have to manage query parsing at the application level and translate to meaningful queries for Solr.
Another option that is available to implementers (I'm not sure adding this to ArcLight out of the box makes sense), would be to configure a fielded search option that includes only unstemmed copies of fields for cases where the searcher knows they don't want stemming. I think Duke's approach is better, but this would provide the expert searcher with more control.

bibliotechy commented 2 weeks ago

For reference in this conversation, Blacklight core ships with a single boosted unstemmed field in the default search. Details below.

I think the inclusion of this in Blacklight core makes a strong case that it would not be heavy handed to also include it in Arclight by default.

In the solrconfig.xml:

<str name="pf">
  all_text_timv^10
</str>

In the schema

all_text_timv is defined as a text field.

<field name="all_text_timv" type="text" stored="false" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

text fieldType is defined with no stemming in the analysis

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>  <!-- NFKC, case folding, diacritics removed -->
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>
</fieldType>

all_text_timv is the destination of multiple copy fields

<copyField source="*_tsim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_tesim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_ssim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_si" dest="all_text_timv" maxChars="3000"/>

projectblacklight / arclight

search stemming is overzealous #1563