maplibre / maplibre-native

MapLibre Native - Interactive vector tile maps for iOS, Android and other platforms.
https://maplibre.org
BSD 2-Clause "Simplified" License
1.01k stars 297 forks source link

String expressions operate on bytes instead of characters #2730

Open 1ec5 opened 1 month ago

1ec5 commented 1 month ago

String expression operators operate on bytes rather than characters, with unexpected results for any input that isn’t pure ASCII. The style specification unhelpfully only mentions lengths and indices but doesn’t define them in terms of anything. Even so, a typical developer is very likely to assume it counts the number of characters. A byte count would be the least expected value.

Example

Create a new iOS project with the following view controller code:

import UIKit
import MapLibre

// Set this constant to any non-ASCII string.
let name = "ñ"

class ViewController: UIViewController, MLNMapViewDelegate {
    var mapView: MLNMapView!

    override func viewDidLoad() {
        super.viewDidLoad()

        mapView = MLNMapView(frame: view.bounds)
        mapView.delegate = self
        mapView.autoresizingMask = [.flexibleWidth, .flexibleHeight]
        mapView.styleURL = MLNStyle.predefinedStyle("Streets")?.url
        view.addSubview(mapView)
    }

    func mapView(_ mapView: MLNMapView, didFinishLoading style: MLNStyle) {
        let feature = MLNPointFeature()
        feature.coordinate = CLLocationCoordinate2D(latitude: 0, longitude: 0)
        feature.attributes = [
            "name": name,
        ]
        let source = MLNShapeSource(identifier: "test", shape: feature)
        style.addSource(source)

        let lengthLayer = MLNSymbolStyleLayer(identifier: "length", source: source)
        lengthLayer.text = NSExpression(mglJSONObject: ["to-string", ["length", ["get", "name"]]])
        lengthLayer.textFontNames = NSExpression(forConstantValue: ["Open Sans Semibold"])
        lengthLayer.textFontSize = NSExpression(forConstantValue: 30)
        lengthLayer.textOffset = NSExpression(forConstantValue: CGPoint(x: 0, y: 0))
        style.addLayer(lengthLayer)
    }
}

Expected and actual behavior

The label should indicate the length of the name constant in characters:

name UTF-8 Expected label Actual label
A 41 1 1
ñ C3 B1 1 2
E4 B8 90 1 3
𦨭 F0 A6 A8 AD 1 4
🇺🇳 F0 9F 87 BA F0 9F 87 B3 1 or 2 or 4 8
丐𦨭市镇 E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87 4 13

Here’s what the map looks like when name is 丐𦨭市镇:

The number 13 centered over Null Island, over a map of the world.

Impact

This would be especially surprising to an iOS/macOS developer, since Objective-C and Swift are both very opinionated about how strings are stored and measured:

String UTF-8 NSString.length
(Objective-C)
String.count
(Swift)
A 41 1 1
ñ C3 B1 1 1
E4 B8 90 1 1
𦨭 F0 A6 A8 AD 2 1
🇺🇳 F0 9F 87 BA F0 9F 87 B3 4 1
丐𦨭市镇 E4 B8 90 F0 A6 A8 AD E5 B8 82 E9 95 87 5 4

The gold standard is to count graphemes, as in Swift, but at least counting UTF-16 characters would be a little more reasonable and consistent with GL JS.

Platform information

Diagnosis

As detailed in https://github.com/maplibre/maplibre-gl-js/pull/4550#issuecomment-2290904561, MVT-compliant tiles encode strings as UTF-8. Each implementation is free to store the string however it pleases; apparently mbgl is storing it as an std::string (aka std::basic_string<char>). It isn’t necessarily a problem that mbgl stores the string as raw bytes, but this implementation detail should not be exposed to the developer. Unfortunately, the length operator simply calls size() on the raw byte string:

https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/length.cpp#L16

Some other string operators also appear to operate on raw bytes, even expecting a raw byte offset as input:

https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/index_of.cpp#L80 https://github.com/maplibre/maplibre-native/blob/ac606a1af2632d531cfb6121427b34785d1056e6/src/mbgl/style/expression/slice.cpp#L92

At least std::string should be replaced by a multibyte container such as std::u8string or std::u16string to handle extremely common cases like accented Latin text and Arabic text. But really this implementation should be using ICU, which is already a dependency or available from the platform on every supported platform, as far as I can tell.

louwers commented 1 month ago

@1ec5 Thanks for the detailed bug report.

Do you happen to know the behavior of the length expression with MapLibre GL JS?

1ec5 commented 1 month ago

Do you happen to know the behavior of the length expression with MapLibre GL JS?

Not as bad, but still suboptimal and very different than the native implementation: maplibre/maplibre-style-spec#778.