The built-in fuzzy matcher appears to be broken if a query contains a space but the entity name doesn't

Snowysauce commented 1 month ago

See #10033 for background details. This is my understanding of the situation based on the report there and further testing in iD on a blank node in the Philippines:

Desired entity	Query	Result
PizzaExpress	PizzaExpress	preset appears in iD
PizzaExpress	Pizza Express	preset does not appear
Pizza Hut	Pizza Hut	preset appears
Pizza Hut	PizzaHut	preset appears

Since we have the built-in fuzzy matcher that is intended to handle these slight variations in text, any attempts to add these variations as matchNames (e.g., "Pizza Express" and "PizzaHut") are removed by the build script, making workarounds impossible.

Snowysauce commented 1 month ago

Further testing: the fuzzy matcher is working as expected for existing objects (e.g., a new node manually set with amenity=restaurant and name=Pizza Express in iD will immediately return a suggestion for the "PizzaExpress" preset). This may end up being an issue on the consumer end, not ours (namely the search functions of iD, StreetComplete, etc.)

bhousel commented 1 month ago

iD's preset matcher is totally different code from the NSI matcher.
I wouldn't expect it to do anything special with whitespace in the query string - if people want it to do that, it would be an issue for iD?

Snowysauce commented 1 month ago

I wouldn't expect it to do anything special with whitespace in the query string - if people want it to do that, it would be an issue for iD?

It seems that way from my perspective. The OP of the issue came here from StreetComplete, so that app appears to behave the same way as iD. In any event, after the tests I've done, I agree that it looks like our code is working fine and that the issue found by OP lies with the products consuming the NSI data.

peternewman commented 1 month ago

Am I right in thinking (or do you know @bhousel ) is it the case that in iD it uses the generic "terms" search of the preset for normal operation, but it seems to use NSI's matcher and simplify.js when offering to upgrade the tags "Convenience Store 711 looks like a common feature with nonstandard tags"?

Given this preset output:

    "advertising/totem/7eleven-445a61": {
      "name": "7-Eleven",
      "locationSet": {"include": ["001"], "exclude": ["pl"]},
      "geometry": ["point", "line"],
      "matchScore": 2,
      "imageURL": "https://graph.facebook.com/7ElevenMexico/picture?type=large",
      "terms": ["7-11", "7-eleven", "seven eleven"],
      "tags": {"advertising": "totem", "brand:wikidata": "Q259340"},
      "addTags": {
        "advertising": "totem",
        "brand": "7-Eleven",
        "brand:wikidata": "Q259340"
      }
    },

I'd agree it would be possible for both iD, StreetComplete and any other consumers of the presets to do something around whitespace processing (at the very naïve end, just store a version of each term with all whitespace, and possibly other non alphanumeric, characters removed, then do the same to your search term and see if they match), but given iD at least generates results after two characters, I suspect it would just generate a lot more noise (e.g. an chor..., the atre..., I'm sure there are some much better real world examples).

Whereas to my mind it would make a lot more sense if NSI added these (for a few specific exceptions such as this), given it has the domain knowledge (otherwise you really want to split the thing based on "words", but that's hard when a brand isn't necessarily a real word). It can ignore them for its fuzzy matching, but basic search consumers would find them as terms without needing any clever and infallible logic their end. The raw NSI data could have a separate field so matchNames remains clean, and say matchSpaceNames is used rarely. I note that "7 11" also fails as a search term, but likewise works for upgrade and doesn't feel like a huge leap.

In terms of PizzaExpress itself, I'd argue its particularly worthy given they seem rather confused about their identity, whilst their website text religiously uses PizzaExpress as one word, their logo/branding is as two lines: Pizza Express

Which I think is why I keep getting tripped up by this particular issue.

If not, is it right to say that node-diacritics and https://github.com/osmlab/name-suggestion-index/blob/main/lib/simplify.js is all the pre-processing that needs to be done to match an alternative version like the upgrade tags does?

Similar example/test (to make a note of it as much as anything), which I'd agree probably should be pushed to the end consumers of the data given its fairly trivial is "Punjab & Sind Bank" in India works as a preset, but "Punjab and Sind Bank", both work for upgrading tags.

bhousel commented 1 month ago

Am I right in thinking (or do you know @bhousel ) is it the case that in iD it uses the generic "terms" search of the preset for normal operation, but it seems to use NSI's matcher and simplify.js when offering to upgrade the tags "Convenience Store 711 looks like a common feature with nonstandard tags"?

Yes, this is correct - iD's preset code uses the terms, name, tags and other criteria to match a preset to an OSM feature.

iD's validation code uses the NSI matcher, which really only looks at tags.

This sort of makes sense because the preset picker needs to work in whatever language the user speaks, but the validators should just work directly with OSM tags.

osmlab / name-suggestion-index

The built-in fuzzy matcher appears to be broken if a query contains a space but the entity name doesn't #10040