Closed jenniferward closed 2 years ago
This is not very easy to do, nevertheless I would like to have a project to fix this once in 6.0
The solution may be already implemented in standard Solr: https://stackoverflow.com/questions/23170209/how-to-ignore-accent-search-in-solr
Moreover, I'd say that it would be good to be implemented in all fields, not only titles.
I apply this patch for each new (test) Muscat instance that solves it well since 5.x. It allows searching with or without diacritics in any field, independently if the original bibliographic record has or doesn't have diacritic, being in uppercase or lowercase. Reindexing the records should fix it for older indices.
diff --git a/solr/configsets/sunspot/conf/schema.xml b/solr/configsets/sunspot/conf/schema.xml
index aae9274..960bea5 100644
--- a/solr/configsets/sunspot/conf/schema.xml
+++ b/solr/configsets/sunspot/conf/schema.xml
@@ -64,6 +64,7 @@
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
+ <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
@@ -92,6 +93,7 @@
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
+ <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
</analyzer>
</fieldType>
<!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
@@ -122,6 +124,7 @@
when you want your sorting to be case insensitive
-->
<filter class="solr.LowerCaseFilterFactory" />
+ <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<!-- The TrimFilter removes any leading or trailing whitespace -->
<filter class="solr.TrimFilterFactory" />
<!-- Remove leading articles -->
Thnks! I will try it out, the filter in Solr seems a good idea, I never managed to get back to this issue.
My example works on muscat-test now. Thanks, @fjorba !
Glad it helps!
We just discovered (thank you Guido) that searches do not work if the term (1) has a diacritic and (2) is truncated. No results. Examples: Subject headings: rév does not find révolution, but rev finds révolution Personal names: vásque does not find Vásquez Secondary lit: Aufführ (for Aufführungen) Sources: zöllne* (for Zöllner)
In live Muscat, truncated searches with diacritics do indeed work.
In my modest opinion, I would resolve this case in Muscat itself. I didn't find a solution in Solr, but I think that a couple of lines in Ruby would suffice. Something like:
if search string contains a *, remove all diacritics and send it to Solr
I'm don't feel myself competent enough in Ruby yet to solve it fast and well, but that's how I'd do it.
There is a more advanced folding filter than the ASCII Folding Filter that would work for expanded character sets, the ICU Folding Filter:
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#icu-folding-filter
This works with the expanded Unicode set, so it is much more likely to catch a wide variety of characters and convert them to ASCII characters in a standardized way. This can be added in both the index and query analyzers. Doing it on the index filter will do exactly what is proposed by @fjorba , but instead of removing the diacritic will "fold" it to an equivalent ASCII character.
This will also handle extended Unicode ligatures, like expanding "æ" to "ae", etc.
Thanks for the pointer. According to this documentation,
To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Solr Plugins). See solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add.
Factory class: solr.ICUFoldingFilterFactory
That would probably complicate the Solr setup. I'm not saying that it is not needed, but that it has a higher cost.
Maybe there is a third possibility. You are also right to point out (or I didn't write it well) that removing the diacritics souldn't remove the whole characters. I mean, the result of my proposal for searching Genèv would be Genev, not Genv*. That is what can be done in Ruby, and probably in a correct way. In Python, I'm used to solve it this way, and this is what I don't know yet how to do do it in Ruby, although I have no doubt that it can be done:
import unicodedata
[...]
flat = ''.join([c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn'])
Solving this case in Ruby would probably lead to a simpler Muscat installation than solving it in Solr.
The additional jars are distributed with Solr itself -- I've done this many times before, and it's quite trivial:
https://github.com/bodleian/iiif_manifest_server/blob/master/solr/conf/solrconfig.xml#L4-L6
To match the missing diacritics in Solr by handling it in Ruby would mean that the diacritics would need to be removed in Solr at index time itself as well. If the index contains the word Genève
, without a folding filter, then trying to match it with a search string of Genev*
would not work, since it would, at best, be able to match it against a search of Genèv*
. This would probably mean doing more in the muscat internals to make sure the indexed content matched the expected query content.
On the other hand, this is precisely the use case for using a folding filter on the field's query analyzer: To normalize the extended character sets into a standardized form at query time. By doing it on both index and query analyzers, you are ensured that users queries will be transformed by the same process by which the content was indexed.
For reference, here is a fulltext field definition that does exactly what I am describing:
<fieldType name="text_fulltext" class="solr.TextField" autoGeneratePhraseQueries="true" multiValued="true" termVectors="true" stored="true">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory" />
<filter class="solr.WordDelimiterGraphFilterFactory" splitOnNumerics="0" preserveOriginal="1" />
<!-- synonym filters only need to be on index OR query, not both -->
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory" />
<filter class="solr.WordDelimiterGraphFilterFactory" splitOnNumerics="0" preserveOriginal="1" />
</analyzer>
</fieldType>
I personally think we should not do this in ruby, as it can be more cleanly done in solr. The main reason for a lingering fix on this issue is that the current implementation of solr indexing is not completely optimal in Muscat, and I would like to restructure it from ground up taking these kind of issues in consideration when making the new configuration.
Obviously a Solr solution would be better. My Ruby suggestion was only for using a search string that has both a diacritic and a wildcard, where my simpler ASCIIFoldingFilterFactory solution was not working.
Specially if it doesn't complicate Solr installation, as now it is so simple.
I'm afraid that waiting for a complete solution that includes truncation has caused that an easy one (https://github.com/rism-ch/muscat/issues/690#issuecomment-678727210) is still not applied, and Muscat, out of the box, is unable to search ignoring diacritics, as widely expected. As I'm only capable to create a patch for the easy one, should I create a pull request for it?
We are working on upgrades to Solr which, once in place, and will implement the Unicode folding filters and provide diacritic-insensitive search.
See #306 for an example. I implemented @ahankinson's proposed solution and it seems to work very well, I will dig more into this with @ahankinson and report back on when we can implement this.
Just for reference, this is implemented with a new field
<fieldType name="text_alphanumeric_sort" class="solr.ICUCollationField" locale="" numeric="true" strength="secondary" alternate="shifted" sortMissingLast="true" />
And then it can be attached to a field like this:
<dynamicField name="*_ans_s" type="text_alphanumeric_sort" indexed="true" stored="false" multiValued="false"/>
Ideally a copyfield is used, we cannot do it easily right now with Sunspot. I'm also closing #306 and #978 since it is fixed by this patch, right now in test with full_name in Person, in the future it will be applied to the other auth files and Sources (so I'm leaving the relevant ticket open, #881)
@xhero I think this can be closed now?
If our colleagues are happy with how it works, yes!
@docudoctor and @alexandermarxen please confirm
Yes, thank you very much! It works very well!
This came up at the workshop, not sure how realistic it is: There was a wish to have diacritics ignored in searches, so a search for Bar would also find Bär. The reason is that it is not always intuitive when diacritics are needed on a word, and keyboards do not have the diacritics for foreign languages. Even when using a tool like a character map, it is not always easy to be able to tell what character is meant.
See also: #622 #306