Open alexduryee opened 2 weeks ago
Some initial responses:
EnglishMinimalStemFilterFactory
in other projects and it's less aggressive.For reference in this conversation, Blacklight core ships with a single boosted unstemmed field in the default search. Details below.
I think the inclusion of this in Blacklight core makes a strong case that it would not be heavy handed to also include it in Arclight by default.
In the solrconfig.xml:
<str name="pf">
all_text_timv^10
</str>
In the schema
all_text_timv
is defined as a text
field.
<field name="all_text_timv" type="text" stored="false" indexed="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
text
fieldType
is defined with no stemming in the analysis
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/> <!-- NFKC, case folding, diacritics removed -->
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
all_text_timv
is the destination of multiple copy fields
<copyField source="*_tsim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_tesim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_ssim" dest="all_text_timv" maxChars="3000"/>
<copyField source="*_si" dest="all_text_timv" maxChars="3000"/>
Following up from the November 4 community call, there was discussion around search term stemming in Solr, and how it's currently too aggressive. Users have found that the following terms are getting buried due to stemming:
eugenics
matcheseugene
organs
matchesorganization
There's no way for the user to search exact terms without stemming, since quotation marks only group phrases and won't bypass stemming.Duke's approach to this was to include an unstemmed index field (https://gitlab.oit.duke.edu/dul-its/dul-arclight/-/blob/main/solr/arclight/conf/solrconfig.xml#L133), which is weighted above the stemmed ones.
Questions to discuss: