Search of items containing diacritics

hacketiwack commented 2 years ago

The search needs improvements. Especially when it comes to search for titles or artist names with diacritics.

Current situation In search fields of any Owntone server clients - Apple Remote app, Owntone web interface, etc. - it is not possible to search for an item name - tracks, artists, playlists, podcasts, albums, audiobooks, etc. - without specifying exactly name of the item with its correct spelling: e.g. searching "Lumière" will not bring the same results as searching "Lumiere", or "Crăciun fericit" should be findable with "Craciun fericit". We're not talking here about the transliteration cases, which should be the discussion for another enhancement.

Expected situation When a user search for an item in their library, they should not have to enter any accented characters (with diacritics) in order to find them. A search should be able to find all items being written with or without accents.

Idea I've looked at the source code and, in my opinion, one of the elegant way to handle this enhancement is to amend the method sqlext_daap_unicode_xcollation in the file sqlext.c. This collation function should:

Normalize the strings to compare with the Normalization Form NFKD (method u8_normalize of GNU libunistring).
Iterate over the strings to remove the diacritics (method uc_is_property_diacritic)
Compare the strings as it is already the case today.

ejurgensen commented 2 years ago

Yes, I agree there is an issue. The solution you propose sounds sensible. I’m impressed that you were able to navigate the code and find the sql extension. Are you up for making a pr with this?

hacketiwack commented 2 years ago

C might not be my native language :-P I'll try to do my best to create a working PR :-)

whatdoineed2do commented 2 years ago

@hacketiwack - I gave this a go and updated the sqlext.c borrowing code from GNOME's tracker project thats been stable for about 7yrs.

The extra trick is in the json module - all search queries come through and map to a LIKE query but it appears SQLite doesnt support collating with a like. Certainly when this matches what i observe in sqlite3 with the modified .so loaded.

$ curl -s 'http://localhost:3689/api/search?type=artists&media_kind=music&query=foo'| jq
{
  "artists": {
    "items": [
      {
        "id": "2284445307090104385",
        "name": "foo",
        "name_sort": "foo",
        "album_count": 4,
        "track_count": 8,
        "length_ms": 150247,
        "time_played": "2022-02-04T15:30:08Z",
        "time_added": "2022-01-29T13:57:03Z",
        "in_progress": false,
        "media_kind": "music",
        "data_kind": "file",
        "uri": "library:artist:2284445307090104385",
        "artwork_url": "./artwork/group/23?v=1643559909"
      }
    ],
    "total": 1,
    "offset": 0,
    "limit": -1
  }
}

$ curl -s 'http://localhost:3689/api/search?type=artists&media_kind=music&query=foó'| jq
{
  "artists": {
    "items": [
      {
        "id": "6275129402329078952",
        "name": "foó",
        "name_sort": "foó",
        "album_count": 1,
        "track_count": 1,
        "length_ms": 6034,
        "time_added": "2022-02-05T10:30:41Z",
        "in_progress": false,
        "media_kind": "music",
        "data_kind": "file",
        "uri": "library:artist:6275129402329078952",
        "artwork_url": "./artwork/group/41?v=1644057070"
      }
    ],
    "total": 1,
    "offset": 0,
    "limit": -1
  }
}

vs

$ sqlite3 ~/dev/_install/var/cache/owntone/songs3.db 
SQLite version 3.34.0 2020-12-01 16:14:00
Enter ".help" for usage hints.
sqlite> .load /home/ray/dev/_install/lib/owntone/owntone-sqlext.so
sqlite> select artist,title from files where artist = "foo";
foo|48k sine
foo|MZZZ
foo|abcd
foo|efgh
foo|XYZ
foo|AbCd
foo|one
foo|three
foó|with diacratic
sqlite>

So the Q will be if @ejurgensen will be happy to downgrade the searches to approx equals instead like. I didnt check what is requested from Apple Remote appl but the web interface search won't work with just the sqlext change (but still worthy of a PR #1415 which I'll send through)

whatdoineed2do commented 2 years ago

Changes in httpd_json's search_*() queries modified like this:

diff --git a/src/httpd_jsonapi.c b/src/httpd_jsonapi.c
index 5989c440..60c35a34 100644
--- a/src/httpd_jsonapi.c
+++ b/src/httpd_jsonapi.c
@@ -4290,9 +4290,9 @@ search_tracks(json_object *reply, struct httpd_request *hreq, const char *param_
   if (param_query)
     {
       if (media_kind)
-   query_params.filter = db_mprintf("(f.title LIKE '%%%q%%' AND f.media_kind = %d)", param_query, media_kind);
+   query_params.filter = db_mprintf("((f.title LIKE '%%%q%%' OR f.title = '%q') AND f.media_kind = %d)", param_query, media_kind);
       else
-   query_params.filter = db_mprintf("(f.title LIKE '%%%q%%')", param_query);
+   query_params.filter = db_mprintf("(f.title LIKE '%%%q%%' OR f.title = '%q')", param_query);
     }
   else
     {
@@ -4351,9 +4351,9 @@ search_artists(json_object *reply, struct httpd_request *hreq, const char *param
   if (param_query)
     {
       if (media_kind)
-   query_params.filter = db_mprintf("(f.album_artist LIKE '%%%q%%' AND f.media_kind = %d)", param_query, media_kind);
+   query_params.filter = db_mprintf("((f.album_artist LIKE '%%%q%%' OR f.album_artist = '%q') AND f.media_kind = %d)", param_query, media_kind);
       else
-   query_params.filter = db_mprintf("(f.album_artist LIKE '%%%q%%')", param_query);
+   query_params.filter = db_mprintf("(f.album_artist LIKE '%%%q%%' OR f.album_artist = '%q')", param_query);
     }
   else
     {
@@ -4413,9 +4413,9 @@ search_albums(json_object *reply, struct httpd_request *hreq, const char *param_
   if (param_query)
     {
       if (media_kind)
-   query_params.filter = db_mprintf("(f.album LIKE '%%%q%%' AND f.media_kind = %d)", param_query, media_kind);
+   query_params.filter = db_mprintf("((f.album LIKE '%%%q%%' OR f.album = '%q') AND f.media_kind = %d)", param_query, media_kind);
       else
-   query_params.filter = db_mprintf("(f.album LIKE '%%%q%%')", param_query);
+   query_params.filter = db_mprintf("(f.album LIKE '%%%q%%' OR f.album = '%q')", param_query);
     }
   else
     {
@@ -4475,9 +4475,9 @@ search_composers(json_object *reply, struct httpd_request *hreq, const char *par
   if (param_query)
     {
       if (media_kind)
-   query_params.filter = db_mprintf("(f.composer LIKE '%%%q%%' AND f.media_kind = %d)", param_query, media_kind);
+   query_params.filter = db_mprintf("((f.composer LIKE '%%%q%%' OR f.composer = '%q') AND f.media_kind = %d)", param_query, media_kind);
       else
-   query_params.filter = db_mprintf("(f.composer LIKE '%%%q%%')", param_query);
+   query_params.filter = db_mprintf("(f.composer LIKE '%%%q%%' OR f.composer = '%q')", param_query);
     }
   else
     {
@@ -4532,7 +4532,7 @@ search_playlists(json_object *reply, struct httpd_request *hreq, const char *par

   query_params.type = Q_PL;
   query_params.sort = S_PLAYLIST;
-  query_params.filter = db_mprintf("((f.type = %d OR f.type = %d OR f.type = %d) AND f.title LIKE '%%%q%%')", PL_PLAIN, PL_SMART, PL_RSS, param_query);
+  query_params.filter = db_mprintf("((f.type = %d OR f.type = %d OR f.type = %d) AND (f.title LIKE '%%%q%%' OR f.title = '%q'))", PL_PLAIN, PL_SMART, PL_RSS, param_query);

   ret = fetch_playlists(&query_params, items, &total);
   if (ret < 0)

will, with the sqlext, give @hacketiwack what you're looking for:

$ curl -s 'http://localhost:3689/api/search?type=artists&media_kind=music&query=foo'| jq
{
  "artists": {
    "items": [
      {
        "id": "2284445307090104385",
        "name": "foo",
        "name_sort": "foo",
        "album_count": 4,
        "track_count": 8,
        "length_ms": 150247,
        "time_played": "2022-02-04T15:30:08Z",
        "time_added": "2022-01-29T13:57:03Z",
        "in_progress": false,
        "media_kind": "music",
        "data_kind": "file",
        "uri": "library:artist:2284445307090104385",
        "artwork_url": "./artwork/group/23?v=1643559909"
      },
      {
        "id": "6275129402329078952",
        "name": "foó",
        "name_sort": "foó",
        "album_count": 1,
        "track_count": 1,
        "length_ms": 6034,
        "time_added": "2022-02-05T10:30:41Z",
        "in_progress": false,
        "media_kind": "music",
        "data_kind": "file",
        "uri": "library:artist:6275129402329078952",
        "artwork_url": "./artwork/group/41?v=1644057070"
      }
    ],
    "total": 2,
    "offset": 0,
    "limit": -1
  }
}

hacketiwack commented 2 years ago

Excellent @whatdoineed2do! Thanks

whatdoineed2do commented 2 years ago

The fix to apple remote appl is potentially tricky. The search request comes through bookended witth wildcards: for example the search for Foo arrives at the search as *Foo* and thats how the lexer picks this entire token

daap: DAAP request: '/databases/1/groups?meta=dmap.itemname,dmap.itemid,dmap.persistentid,daap.songartist,daap.songyear,daap.songtracknumber,daap.groupalbumcount,daap.songartistid&type=music&group-type=artists&sort=album&include-sort-headers=1&query=('daap.songartist:*Foo*'+'daap.songartist!:'+('com.apple.itunes.extended-media-kind:1','com.apple.itunes.extended-media-kind:32'))&session-id=1274859047'
daap: DAAP request: '/databases/1/groups?meta=dmap.itemname,dmap.itemid,dmap.persistentid,daap.songartist,daap.songyear,daap.songtracknumber,com.apple.itunes.cloud-id,daap.songartistid,daap.songalbumid,dmap.persistentid,daap.songtime,daap.songdatereleased,daap.songgenre,dmap.downloadstatus&type=music&group-type=albums&sort=album&include-sort-headers=0&query=('daap.songalbum:*Foo*'+'daap.songalbum!:'+('com.apple.itunes.extended-media-kind:1','com.apple.itunes.extended-media-kind:32'))&session-id=1274859047'

@ejurgensen - a simple way to do this is to modifer the lexer that creates the query and stripping the bookends * and adding the extra clause:

diff --git a/src/parsers/daap_parser.y b/src/parsers/daap_parser.y
index 3c9109d0..70fe61c0 100644
--- a/src/parsers/daap_parser.y
+++ b/src/parsers/daap_parser.y
@@ -292,6 +292,8 @@ static void sql_append_dmap_clause(struct daap_result *result, struct ast *a)
   bool is_equal = (a->type == DAAP_T_EQUAL);
   char escape_char;
   char *key;
+  char *wildcard_stripped;
+  int len;

   if (!k || k->type != DAAP_T_KEY || !(key = (char *)k->data))
     {
@@ -335,11 +337,27 @@ static void sql_append_dmap_clause(struct daap_result *result, struct ast *a)
     }
   else if (!dqfm->as_int && v->type == DAAP_T_WILDCARD)
     {
+      sql_append(result, "(");
+      if (is_equal)
+        {
+          len = strlen((char *)v->data);
+          if (len > 2)
+            {
+              // bookended with '*', which we strip
+              wildcard_stripped = strdup(((char *)v->data)+1);
+              wildcard_stripped[len-2] = '\0';
+
+              sql_append(result, "%s = '%s' OR ", dqfm->db_col, wildcard_stripped);
+              free(wildcard_stripped);
+            }
+         }
+
       sql_like_escape((char **)&v->data, &escape_char);
       sql_str_escape((char **)&v->data);
       sql_append(result, "%s", dqfm->db_col);
       sql_append(result, is_equal ? " LIKE " : " NOT LIKE ");
       sql_append(result, "'%s'", (char *)v->data);
+      sql_append(result, ")");
       if (escape_char)
         sql_append(result, " ESCAPE '%c'", escape_char);
       return;

resulting in:

daap: Sorting songlist by album
daap: DAAP browse query filter: ('daap.songartist:*Foo*' 'daap.songartist!:' ('com.apple.itunes.extended-media-kind:1','com.apple.itunes.extended-media-kind:32'))
daap: SQL filter w/client mod: ((f.album_artist = 'Foo' OR f.album_artist LIKE '%Foo%') AND (1 = 1) AND (f.media_kind = 1 OR (1 = 0))) AND (f.data_kind <> 1)

And this brings back the correct result on the IOS app.

Is this acceptable for the DAAP parser? i'm not overly sure how to upd the lex tokens to automatially pull this apart

ejurgensen commented 2 years ago

The proposed solution looks odd to me, I don't see how adding an "OR xxx" is an actual solution. Not sure what is meant by approx equals? At first glance the changes to parser also don't look very nice.

whatdoineed2do commented 2 years ago

The collation change in sqlext appears is not enough to suppport diacritic artist / album /... searches (sqliite only supports collation search critieria using =) and thus need changes in jsonapi and daap lexer to enhance the existing artist / album /... searches which currently use search criteria like

To expand: We add diacritic handling in the existing sqlext DAAP collation. BUT this will only work with queres like this select ... where artist = 'xxx' and query will find the crème brûlée using the text creme brulee

This 'select .... where artist like '%creme% collate DAAP will not find crème brûlée even with the diacritic update to sqlext. Tested this on sqlite cmd line after loading the DAAP sql extension.

To get around this limitation (can't do LIKE) make the existing search for album, artist query from: select album from files where album like '%xxx%' to select album from files where (album = 'xxx' OR album like '%xxx%')

The = will take care of the diacritics with approx equals (sorry, bad phrasing): allowing xxx = creme brulee find crème brûlée.

The lexer mod tries to take care of search strings coming through as *xxx* and tries to use the xxxin the search as above; its v possible that the lexer can be enhanced to give another rules to handle this stripping of wildcards but my yacc/lex is limited.

hacketiwack commented 2 years ago

Thanks for the additional explanations @whatdoineed2do. I'm not a C/C++ programmer (too rusty) and therefore my view on the subject might be wrong: does the code need to go through linting? Was it what rubbed you the wrong way @ejurgensen?

ejurgensen commented 2 years ago

allowing xxx = creme brulee find crème brûlée

This hints at what is wrong, because xxx = creme won't find anything. The point of this kind of a search is to get substring matches. That's why LIKE is used, and why just adding "OR xxx =" isn't a solution.

Since collate doesn't work for LIKE, I think a proper solution would be to create a custom operator. Alternatively, something like unaccent(f.title) LIKE unaccent("%pattern%") might also work, but it is less extensible and not quite so pretty. Then there is also performance to think about.

ejurgensen commented 2 years ago

I did some testing of the unicode LIKE operator in sqlean, and it seems pretty good. It also seems to handle "æ" LIKE "Æ" and even "o" LIKE "Ø". I will check what it would take to use it in OwnTone. Perhaps it is also possible to replace the current DAAP collation with sqlean's, which apparently is dependency free.

hacketiwack commented 2 years ago

It looks very promising! Looking forward to it!

ejurgensen commented 2 years ago

Good news and bad news. The good: sqlean's unicode LIKE implementation seems to work, and you can try it in this branch. The bad: I have mixed feelings about including this sqlean code. It's a peculiar mix of very advanced stuff and basic programming errors, some of which are easily identified just be turning on compiler warnings. It looks like it is quite old, perhaps its a result of going through multiple maintainers. I did report the errors, but so far no response. The errors are patched in the branch I linked to.

Also, their COLLATE isn't attractive in my view - for one, it sorts numbers before letters.

Writing our own LIKE seems a bit risky, so the best would be to find an existing, battle-hardened implementation based on libunistring. My Googling hasn't uncovered that yet, but perhaps you would have better luck.

ejurgensen commented 2 years ago

Looks like sqlean is partially copied from sqlite's icu, which could explain why it is as it is. Using icu would add a new dependency, which I don't like, so instead I think I will try to convert the current dependency to a libunistring dependency. Hopefully that isn't too difficult or risky.

It seems like going this route means that LIKE lookups won't be able to use indexes. Obviously bad for performance, but on the other hand, I don't imagine the current wildcard lookups have great performance either, so maybe it will be ok.

hacketiwack commented 2 years ago

Thanks @ejurgensen. Yes, it seems that there's no straightforward way to do it. I will investigate as well from my side if there are different options.

ejurgensen commented 2 years ago

I've made a libunistring-based LIKE implementation in this new branch, but alas, it has poor performance. It is 4-5x slower than both the current LIKE and the sqlean version. It makes the search noticeably slower on my RPi. The culprit is libunistring's normalization and case folding (u32_casefold).

I'll see if I can find some way of speeding it up.

chme commented 2 years ago

Instead of doing the normalization on every comparison, would it not make sense to do it during the scan (on insert/update)?

We could add a column for the normalized value (or maybe we can reuse the existing sort fields?), during the scan store the normalized value in the db. Before doing a search, we then would have to normalize the search parameter and do the LIKE statement against the new column with the normalized value.

There is also a sqlite extension for full text search: https://www.sqlite.org/fts5.html I have zero experience with it, but it looks like it can handle diacritics.

ejurgensen commented 2 years ago

Unfortunately, reusing the existing sort columns wouldn't work, because a search for e.g. "test 2" would fail, since it is stored as something like "test 00002". So for stored normalization and case folding, it would be necessary to add quite a few mirror columns + additional logic on when to use these columns, and I think that is a considerable downside.

Except for libunistring, it seems normalizing on the fly isn't that expensive. Both sqlean and libicu are on par with vanilla LIKE. Need to test some more, however.

It's also a nice quality of on-the-fly that a regular select title from files where title like "%xxx%" yields a sensible result.

fts5 certainly looks very interesting. As I understand it would add the capability to make fast combined searches (e.g. beatles hard days), which especially for the web UI would be nice. It does look a bit daunting, though :-)

ejurgensen commented 2 years ago

Looks like using fts5's functions for case folding and diacritics removal is a winner. I copied the functions into sqlext.c and then called those instead of the slow libunistring call, and now it is plenty fast. It's in the db_unicode_search1 branch. I will test it a bit more, but so far it is promising.

Now to listen to some of the French music I realize I have after testing diacritics.

hacketiwack commented 2 years ago

That's absolutely awesome. I need to build this branch to test it myself. Thanks a lot.

ejurgensen commented 2 years ago

I've merged it now. Thanks for bringing this forward.

hacketiwack commented 2 years ago

I'm looking forward to using this feature! Thanks a lot!

aleszczynskig commented 2 years ago

Me too. Good work all finding a solution. Finally I can search for my ørganek album and it may hit.

owntone / owntone-server

Search of items containing diacritics #1390