projectblacklight / arclight

A Rails engine supporting discovery of archival material
https://samvera.atlassian.net/wiki/spaces/samvera/pages/405211890/ArcLight
Other
39 stars 26 forks source link

search stemming is overzealous #1563

Open alexduryee opened 2 weeks ago

alexduryee commented 2 weeks ago

Following up from the November 4 community call, there was discussion around search term stemming in Solr, and how it's currently too aggressive. Users have found that the following terms are getting buried due to stemming:

Duke's approach to this was to include an unstemmed index field (https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml#L133), which is weighted above the stemmed ones.

Questions to discuss:

corylown commented 2 weeks ago

Some initial responses:

bibliotechy commented 2 weeks ago

For reference in this conversation, Blacklight core ships with a single boosted unstemmed field in the default search. Details below.

I think the inclusion of this in Blacklight core makes a strong case that it would not be heavy handed to also include it in Arclight by default.


In the solrconfig.xml:

<str name="pf">
  all_text_timv^10
</str>

In the schema

all_text_timv is defined as a text field.

<field name="all_text_timv" type="text" stored="false" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>

text fieldType is defined with no stemming in the analysis

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>  <!-- NFKC, case folding, diacritics removed -->
    <filter class="solr.TrimFilterFactory"/>
  </analyzer>
</fieldType>

all_text_timv is the destination of multiple copy fields

<copyField source="*_tsim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_tesim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_ssim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_si" dest="all_text_timv" maxChars="3000"/>