realm / realm-dart

Realm is a mobile database: a replacement for SQLite & ORMs.
Apache License 2.0
756 stars 84 forks source link

Korean and Japanese characters - queries / full text search #1377

Closed dotjon0 closed 10 months ago

dotjon0 commented 1 year ago

What happened?

We have questions regarding Korean and Japanese languages support around Realm queries / full text search if we may:

Question 1: We have tested Japanese characters in a standard Realm query and this works fine, so Korean and Japanese characters are supported - can you confirm if i.e. Korean and Japanese characters are officially supported (as the documentation does not cover this) or is this a loop hole?

Question 2: Does Realm full text search support Korean and Japanese characters? I.e. at https://www.mongodb.com/docs/realm/sdk/flutter/realm-database/read-and-write-data/#filter-with-full-text-search it states "Tokens can only consist of characters from ASCII and the Latin-1 supplement (western languages). All other characters are considered whitespace. Words split by a hyphen (-) like full-text are split into two tokens." - which suggests Korean and Japanese characters are treated as spaces which is obviously means Korean and Japanese characters are not support? Is this analysis correct? (we have not tested Korean and Japanese yet in full text search)

Question 3: If Realm Full Text Search (in question 2 above) does not support Korean and Japanese characters, are there plans to support this?

Question4: What languages are official supported for Realm (a) queries and (b) full text search? (& what languages work but are not official supported at or (or yet!))?

Many thanks as always!

Repro steps

n/a

Version

1.3

What Atlas Services are you using?

Atlas Device Sync

What type of application is this?

Flutter Application

Client OS and version

13.4.1 macOS

Code snippets

No response

Stacktrace of the exception/crash you're getting

No response

Relevant log output

No response

nirinchev commented 1 year ago

Realm strings (and queries by extension) use UTF-8, which is why Japanese and Korean characters are supported. When doing equality matches, strings are treated as sequences of bytes, so if you're looking for an exact match, we just compare the byte sequences. It's more complicated with capitalization and accents, but you're lucky that these just don't exist in Japanese and Korean.

Case insensitive queries as well as FTS queries are only supported for ASCII and Latin-1 supplement characters. This means that case insensitive and FTS queries won't produce expected results for languages with letters outside these character ranges. So no, Japanese and Korean are not supported with FTS unfortunately.

Finally, there are no short- or medium-term plans to extend FTS support to other languages unless we're talking about a commercial arrangement that needs to go through your company's account executive. Adding support for additional languages is quite time consuming and will have a non-negligible impact on the binary size, so we need to have a pretty solid business case to invest in this.

If you don't need to run offline queries on your local data, you could use Atlas Search and run queries on Atlas data and use the http-based MongoDB client. I have an example of how to do this in C# here: https://github.com/realm/realm-dotnet-samples/tree/ni/atlas-search/AtlasSearch. Note that this does not go through Realm/Sync at all, so the data you're fetching is a static snapshot and needs to be re-fetched to refresh it. Unfortunately, you won't be able to directly port that to flutter because we don't have a Remote MongoDB Client for Flutter - adding it is tracked in https://github.com/realm/realm-dart/issues/789 among others. That one is on our short-/medium-term plans though, so if using Atlas Search is a valid option for you, then we can take a look at more concrete timelines for when you want to go live vs when we expect that work to land.

dotjon0 commented 10 months ago

I am so sorry @nirinchev I did not come back on this! I remember reading it! So thank you very very much for your feedback and will see what we can do given your very useful input! Thanks again!

Remote MongoDB Client for Flutter sounds interesting, what approx timescales are we talking or is it too early to say?

nirinchev commented 10 months ago

It's too early right now, but if having it would be critical for your launch, definitely bring it up with your AE and let them know when you'd need it and they'll work with product and eng to prioritize it against other work.

dotjon0 commented 10 months ago

Thanks @nirinchev we have raised with our AE, CS and product.

deverlex commented 10 months ago

Don't support other languages without US, England. Haizzzz

nirinchev commented 10 months ago

I'm going to close this as there are no immediate plans to add more language support and the Atlas Search support work is tracked in #789.

dotjon0 commented 10 months ago

@nirinchev Device Sync is marketed as "Apps that work on the go — online and offline" https://www.mongodb.com/atlas/app-services/device-sync - there are lots of huge gaps with the RQL and FTS of the Realm Flutter/Dart SDK, this being one of them... Atlas Search just does not work 'offline' so the 'app will not work on the go offline'. Just to feedback to the Realm team that it is very disappointing yet another RQL and FTS related feature is not planned in the short or long term, along with #1407 and #1421

nirinchev commented 10 months ago

As explained in the other issue, we don't feel there's sufficient demand for offline FTS functionality that would justify the increased library size and maintenance complexity of the Realm SDK. And as suggested, running FTS queries using Lucene or other 3rd party library outside of Realm would be the recommended workaround.

dotjon0 commented 10 months ago

@nirinchev completely get re FTS, but can it not be supported via RQL as we could make this work?

Is it possible to have a list of languages that are supported by RQL and FTS, including case-insenstive and case-sensitive parameters? It would be helpful to understand which languages / regions the Realm Flutter/Dart SDK is suitable for.

re: "FTS queries using Lucene or other 3rd party library outside of Realm would be the recommended workaround" - would you kindly be able to give any pointers of any docs, examples, etc so we have some form of direction / staring point re FTS?

FYI we've also raised this ticket with our AE.

nirinchev commented 10 months ago

Not sure what you mean by "can it not be supported via RQL". Japanese/Korean are supported in RQL as strings in Realm are UTF8 encoded and so are the queries. Since both of these languages are case-insensitive and don't have diacritics, the regular contains/like queries would work just fine as I mentioned in my first reply in this issue. Perhaps I'm misunderstanding something - can you clarify what query you're trying to run against Japanese strings and what error/unexpected behavior you're getting there?

dotjon0 commented 10 months ago

Thanks @nirinchev - sorry for the confusion, perhaps my question was not clarified: out of the languages supported by MongoDB Atlas Search at https://www.mongodb.com/docs/atlas/atlas-search/analyzers/language/ (below 43 languages), which of these languages are supported via (a) RQL, and (b) FTS, within the Realm Flutter/Dart SDK please (including case-insensitive and case-sensitive)? So this question is not just limited to Korean and Japanese languages/characters. You also mentioned "more complicated with capitalization and accents" and "diacritics" in parsing, which may suggests (?) that certain languages are not supported by Realm RQL, thus our question above.

nirinchev commented 10 months ago

Diacritic-insensitive queries are not supported in any language, although the FTS tokenization process will remove diacritics from characters in the ASCII and Latin-1 supplement range, so for example the diacritic will be removed from é turning it into e, but it won't be removed from ğ (the Turkish yumuşak ge). So to give you an example

Text Search term Operator Match
allé alle TEXT true
düğün dugun TEXT false
allé alle CONTAINS false
düğün dugun CONTAINS false

Similarly, case insensitive queries are only going to ignore the casing from letters from the ASCII and Latin-1 supplement character range:

Text Search term Operator Match
Hello hello TEXT true
Здравей здравей TEXT false
Hello hello CONTAINS[c] true
Здравей здравей CONTAINS[c] false
Здравей Здравей CONTAINS[c] true

Case-sensitive queries are supported fro all UTF-8 characters. So it's not really language-based, but rather character based and it's also not precise to talk about "supported" vs "not supported" languages since all characters can be stored and queried. It's only when it comes to ignoring the character case that we have limitations about the characters where this actually happens.

dotjon0 commented 10 months ago

Thank you very much for your in-depth explanation @nirinchev, very useful to understand the challenge in front of us.

So effectively the missing parts to get RQL and FTS compatible with all characters from the 43 languages Atlas Search supports and with case-insensitive and case-sensitive support are listed below (i.e. so there is a basic RQL & FTS search function in Realm that both support all common languages globally regardless of case):

Is this right, hope ive understood correctly and is anything else missing? Just want to get really clear of what is needed so we can communicate this effectively within a meeting we have with your senior management team shorty. If its possible to get rough estimates of time required for the above that would be very useful - although if it needs fully scoping to get this info of course do not worry for now.

Perhaps the above has not been raised before as a feature request / bug as it is assumed that support for this is already in place (as in our case...) - and then it only comes up when an end user reports an issue with search... You do kind of assume, especially with FTS, that this is just in place out the box... Essentially any tech company with customers who have end-users globally, etc will hit this problem with Realm Flutter/Dart SDK RQL/FTS...so would imagine its a common roadblock which may go un-noticed for a while...

A further question if we may, re Realm Flutter/Dart SDK around 'FTS': how does RQL differ to FTS in the Realm Flutter/Dart SDK? Are you automatically replicating data to a 2nd 'FTS search' Realm local database under the covers which you are then using for FTS (where all the data has diacritics removed from characters)?

nirinchev commented 10 months ago

We don't have estimates for the time required to add additional language support to our query engine. Note that this is not just a matter of doing the work - while in and of itself it's not trivial, it's a problem that has been addressed in multiple products already, so it's not an unknown area. The main issue is related to the binary size of the Realm SDK. Since the Core database is a native module, there are no mechanisms right now for the flutter compiler to strip it from unused code. This means that every user gets the complete library, regardless of whether they use 1 feature or 100. Since app size negatively correlates with download rates we, as a library provider, need to be extra careful about the size of our SDK to ensure our customers don't unnecessarily lose business due to features they don't use.

Regarding how FTS works, Realm builds an index on the property that you define and uses it when executing FTS queries against the dataset. The index is stored alongside the data in the same Realm file.

dotjon0 commented 10 months ago

@nirinchev thank you for further insight.

The below lists whether or not 'app size negatively correlates with download rates' for a range of use cases (without any data backing this) - this is purely based on our experience, apart from 'B2B mobile':

So effectively by not offering the 'enhanced RQL/FTS' to decrease app size, this is only actually applicable for 50% of the scenarios above. Likely blocking most B2B SaaS vendors, especially B2B enterprise SaaS vendors, from even considering Realm...

Perhaps one of the roots below would solve this and cater for all use cases above:

It is probably worth bearing in mind that localisation into multiple languages is essential for B2B applications here.

To add another angle to the B2C side, in the context of ecommerce, 54% of consumers prefer slower delivery options in order to save carbon emissions see here - so will this big movement of consumers demanding sustainability extend to consumers preferring more sustainable B2C applications with 'larger app sizes' (with enhanced RQL/FTS support) - as enhanced RQL/FTS will avoid 'double fetching data' from Device Sync and Atlas Search (i.e. enhanced RQL/FTS would make local search realistic in many cases)... So the question is, will 'app size negatively correlate with download rates' ongoing for B2C apps - we will find out I guess in time! Of course on the B2B front, sustainability is a very very hot topic and the data volumes/transfers in B2B apps we would imagine is typically far larger than B2C apps.

It seems we are touching on various use cases, and the impact of both UX and sustainability.

Thanks re insight on how FTS works.