rism-digital / muscat

🗂️ A Rails application for the inventory of handwritten and printed music scores
http://muscat-project.org
34 stars 16 forks source link

Search for titles without diacritics #690

Closed jenniferward closed 2 years ago

jenniferward commented 5 years ago

This came up at the workshop, not sure how realistic it is: There was a wish to have diacritics ignored in searches, so a search for Bar would also find Bär. The reason is that it is not always intuitive when diacritics are needed on a word, and keyboards do not have the diacritics for foreign languages. Even when using a tool like a character map, it is not always easy to be able to tell what character is meant.

See also: #622 #306

xhero commented 5 years ago

This is not very easy to do, nevertheless I would like to have a project to fix this once in 6.0

fjorba commented 4 years ago

The solution may be already implemented in standard Solr: https://stackoverflow.com/questions/23170209/how-to-ignore-accent-search-in-solr

Moreover, I'd say that it would be good to be implemented in all fields, not only titles.

fjorba commented 4 years ago

I apply this patch for each new (test) Muscat instance that solves it well since 5.x. It allows searching with or without diacritics in any field, independently if the original bibliographic record has or doesn't have diacritic, being in uppercase or lowercase. Reindexing the records should fix it for older indices.

diff --git a/solr/configsets/sunspot/conf/schema.xml b/solr/configsets/sunspot/conf/schema.xml                                       
index aae9274..960bea5 100644                                                                                                        
--- a/solr/configsets/sunspot/conf/schema.xml                                                                                        
+++ b/solr/configsets/sunspot/conf/schema.xml                                                                                        
@@ -64,6 +64,7 @@                                                                                                                    
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StandardFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
+        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>                                                    
         <filter class="solr.PorterStemFilterFactory"/>
       </analyzer>
     </fieldType>
@@ -92,6 +93,7 @@                                                                                                                    
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.StandardFilterFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
+        <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>                                                    
       </analyzer>
     </fieldType>
     <!-- A specialized field for geospatial search. If indexed, this fieldType must not be multivalued. -->
@@ -122,6 +124,7 @@                                                                                                                  
                  when you want your sorting to be case insensitive
               -->
             <filter class="solr.LowerCaseFilterFactory" />
+            <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>                                                
             <!-- The TrimFilter removes any leading or trailing whitespace -->
             <filter class="solr.TrimFilterFactory" />
             <!-- Remove leading articles -->
xhero commented 3 years ago

Thnks! I will try it out, the filter in Solr seems a good idea, I never managed to get back to this issue.

jenniferward commented 3 years ago

My example works on muscat-test now. Thanks, @fjorba !

fjorba commented 3 years ago

Glad it helps!

jenniferward commented 3 years ago

We just discovered (thank you Guido) that searches do not work if the term (1) has a diacritic and (2) is truncated. No results. Examples: Subject headings: rév does not find révolution, but rev finds révolution Personal names: vásque does not find Vásquez Secondary lit: Aufführ (for Aufführungen) Sources: zöllne* (for Zöllner)

In live Muscat, truncated searches with diacritics do indeed work.

fjorba commented 3 years ago

In my modest opinion, I would resolve this case in Muscat itself. I didn't find a solution in Solr, but I think that a couple of lines in Ruby would suffice. Something like:

if search string contains a *, remove all diacritics and send it to Solr

I'm don't feel myself competent enough in Ruby yet to solve it fast and well, but that's how I'd do it.

ahankinson commented 3 years ago

There is a more advanced folding filter than the ASCII Folding Filter that would work for expanded character sets, the ICU Folding Filter:

https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#icu-folding-filter

This works with the expanded Unicode set, so it is much more likely to catch a wide variety of characters and convert them to ASCII characters in a standardized way. This can be added in both the index and query analyzers. Doing it on the index filter will do exactly what is proposed by @fjorba , but instead of removing the diacritic will "fold" it to an equivalent ASCII character.

This will also handle extended Unicode ligatures, like expanding "æ" to "ae", etc.

fjorba commented 3 years ago

Thanks for the pointer. According to this documentation,

To use this filter, you must add additional .jars to Solr’s classpath (as described in the section Solr Plugins). See solr/contrib/analysis-extras/README.txt for instructions on which jars you need to add.

Factory class: solr.ICUFoldingFilterFactory

That would probably complicate the Solr setup. I'm not saying that it is not needed, but that it has a higher cost.

Maybe there is a third possibility. You are also right to point out (or I didn't write it well) that removing the diacritics souldn't remove the whole characters. I mean, the result of my proposal for searching Genèv would be Genev, not Genv*. That is what can be done in Ruby, and probably in a correct way. In Python, I'm used to solve it this way, and this is what I don't know yet how to do do it in Ruby, although I have no doubt that it can be done:

import unicodedata
[...]
    flat = ''.join([c for c in unicodedata.normalize('NFD', s)
                 if unicodedata.category(c) != 'Mn'])

Solving this case in Ruby would probably lead to a simpler Muscat installation than solving it in Solr.

ahankinson commented 3 years ago

The additional jars are distributed with Solr itself -- I've done this many times before, and it's quite trivial:

https://github.com/bodleian/iiif_manifest_server/blob/master/solr/conf/solrconfig.xml#L4-L6

To match the missing diacritics in Solr by handling it in Ruby would mean that the diacritics would need to be removed in Solr at index time itself as well. If the index contains the word Genève, without a folding filter, then trying to match it with a search string of Genev* would not work, since it would, at best, be able to match it against a search of Genèv*. This would probably mean doing more in the muscat internals to make sure the indexed content matched the expected query content.

On the other hand, this is precisely the use case for using a folding filter on the field's query analyzer: To normalize the extended character sets into a standardized form at query time. By doing it on both index and query analyzers, you are ensured that users queries will be transformed by the same process by which the content was indexed.

ahankinson commented 3 years ago

For reference, here is a fulltext field definition that does exactly what I am describing:

    <fieldType name="text_fulltext" class="solr.TextField" autoGeneratePhraseQueries="true" multiValued="true" termVectors="true" stored="true">
        <analyzer type="index">
            <tokenizer class="solr.ICUTokenizerFactory" />
            <filter class="solr.ICUFoldingFilterFactory"/>
            <filter class="solr.CommonGramsFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.KStemFilterFactory" />
            <filter class="solr.WordDelimiterGraphFilterFactory" splitOnNumerics="0" preserveOriginal="1" />
            <!-- synonym filters only need to be on index OR query, not both -->
            <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
            <filter class="solr.FlattenGraphFilterFactory"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.ICUTokenizerFactory" />
            <filter class="solr.ICUFoldingFilterFactory"/>
            <filter class="solr.CommonGramsQueryFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.KStemFilterFactory" />
            <filter class="solr.WordDelimiterGraphFilterFactory" splitOnNumerics="0" preserveOriginal="1" />
        </analyzer>
    </fieldType>
xhero commented 3 years ago

I personally think we should not do this in ruby, as it can be more cleanly done in solr. The main reason for a lingering fix on this issue is that the current implementation of solr indexing is not completely optimal in Muscat, and I would like to restructure it from ground up taking these kind of issues in consideration when making the new configuration.

fjorba commented 3 years ago

Obviously a Solr solution would be better. My Ruby suggestion was only for using a search string that has both a diacritic and a wildcard, where my simpler ASCIIFoldingFilterFactory solution was not working.

Specially if it doesn't complicate Solr installation, as now it is so simple.

fjorba commented 3 years ago

I'm afraid that waiting for a complete solution that includes truncation has caused that an easy one (https://github.com/rism-ch/muscat/issues/690#issuecomment-678727210) is still not applied, and Muscat, out of the box, is unable to search ignoring diacritics, as widely expected. As I'm only capable to create a patch for the easy one, should I create a pull request for it?

ahankinson commented 3 years ago

We are working on upgrades to Solr which, once in place, and will implement the Unicode folding filters and provide diacritic-insensitive search.

xhero commented 2 years ago

See #306 for an example. I implemented @ahankinson's proposed solution and it seems to work very well, I will dig more into this with @ahankinson and report back on when we can implement this.

xhero commented 2 years ago

Just for reference, this is implemented with a new field

<fieldType name="text_alphanumeric_sort" class="solr.ICUCollationField" locale="" numeric="true" strength="secondary" alternate="shifted" sortMissingLast="true" />       

And then it can be attached to a field like this:

<dynamicField name="*_ans_s" type="text_alphanumeric_sort" indexed="true" stored="false" multiValued="false"/>

Ideally a copyfield is used, we cannot do it easily right now with Sunspot. I'm also closing #306 and #978 since it is fixed by this patch, right now in test with full_name in Person, in the future it will be applied to the other auth files and Sources (so I'm leaving the relevant ticket open, #881)

ahankinson commented 2 years ago

@xhero I think this can be closed now?

xhero commented 2 years ago

If our colleagues are happy with how it works, yes!

jenniferward commented 2 years ago

@docudoctor and @alexandermarxen please confirm

alexandermarxen commented 2 years ago

Yes, thank you very much! It works very well!