Open 1ec5 opened 1 month ago
@1ec5 Thanks for the detailed bug report.
Do you happen to know the behavior of the length expression with MapLibre GL JS?
Do you happen to know the behavior of the length expression with MapLibre GL JS?
Not as bad, but still suboptimal and very different than the native implementation: maplibre/maplibre-style-spec#778.
String expression operators operate on bytes rather than characters, with unexpected results for any input that isn’t pure ASCII. The style specification unhelpfully only mentions lengths and indices but doesn’t define them in terms of anything. Even so, a typical developer is very likely to assume it counts the number of characters. A byte count would be the least expected value.
Example
Create a new iOS project with the following view controller code:
Expected and actual behavior
The label should indicate the length of the
name
constant in characters:name
A
ñ
丐
𦨭
🇺🇳
丐𦨭市镇
Here’s what the map looks like when
name
is丐𦨭市镇
:Impact
This would be especially surprising to an iOS/macOS developer, since Objective-C and Swift are both very opinionated about how strings are stored and measured:
NSString.length
(Objective-C)
String.count
(Swift)
A
ñ
丐
𦨭
🇺🇳
丐𦨭市镇
The gold standard is to count graphemes, as in Swift, but at least counting UTF-16 characters would be a little more reasonable and consistent with GL JS.
Platform information
Diagnosis
As detailed in https://github.com/maplibre/maplibre-gl-js/pull/4550#issuecomment-2290904561, MVT-compliant tiles encode strings as UTF-8. Each implementation is free to store the string however it pleases; apparently mbgl is storing it as an
std::string
(akastd::basic_string<char>
). It isn’t necessarily a problem that mbgl stores the string as raw bytes, but this implementation detail should not be exposed to the developer. Unfortunately, thelength
operator simply callssize()
on the raw byte string:https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/length.cpp#L16
Some other string operators also appear to operate on raw bytes, even expecting a raw byte offset as input:
https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/index_of.cpp#L80 https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/slice.cpp#L92
At least
std::string
should be replaced by a multibyte container such asstd::u8string
orstd::u16string
to handle extremely common cases like accented Latin text and Arabic text. But really this implementation should be using ICU, which is already a dependency or available from the platform on every supported platform, as far as I can tell.