Closed orangejulius closed 3 years ago
I'm :+1: for this, just interested to see if it causes any acceptance test failures before merging.
If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.
I don't remember the history of this but it's likely the convention was copied across to other parts of the codebase.
I thought things like Ville-Lumière
, Frisco
, Big Apple
were supposed to be filed under 'colloquial' 🤷♂️
It's seems 'variant' is where toponyms go to die 😆
If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.
Agreed, I was just thinking about that. The impact of all those extra names is probably even higher for Placeholder since it considers matches across the entire parent hierarchy.
Okay, the results from this are in and they look pretty good. There is a decent increase to the overall score of our autocomplete acceptance tests, and a big increase in some other test cases like top_us_cities
and us_states
. There's almost no difference to test suites that look at addresses, which is expected for a change to the WOF importer.
As far as I can see there are almost no significant regressions from this change. A few individual autocomplete characters here and there, but nothing that looks like a trend.
If I had to summarize, overall it looks like removing variant names has three main positive effects
name
field to be longer, and thus gives a lower score)Here's some cases that show off one or more of these.
This is a query that has often been tough to get right since there are several results we want near the top, and lots of chances for duplicates or undesirable records to sneak in.
The autocomplete results don't really tell the whole story, here's the results from the query before/after:
The test says that both New York city and county should appear in the results. I'd argue We should add New York State to that list. But in any case the removal of variant names mean that the desired results for the WOF city and county records score higher than before. This boosts them above the East New York
locality and the New York City
result from Geonames (hopefully we can remove that one completely via deduplication someday).
I think this also shows that when we really fix our scoring in https://github.com/pelias/pelias/issues/862, we'll see even more and better results like this.
A common trend in city and state results is fixing issues where a record simply wasn't ever displayed because it would be deduplicated. For example, the state of Missouri would essentially never come up in results because it would be deduped with Missouri Township, MO, which has Missouri
in its list of variant names.
Our deduplication code currently prefers more granular results (for example, locality
over county
or region
) in these cases. We might want to make that a little bit more strict with something like https://github.com/pelias/api/issues/1557. A region
and a locality
with wildly different populations should probably not be considered duplicates if we can avoid that causing issues with places like Berlin.
There's still some deduplication related issues here that we should look at, many of them can be fixed with data updates.
Individually these are all not necessarily amazing changes, but I noticed a decent trend of cities showing up one or two characters earlier in results. When we're talking queries that are only 2-4 characters, that's actually a big deal!
I was expecting a big of a decrease in index size for this change, since there are a reasonable number of variant names out there. But it turned out to only be about 5MB. I suppose there might be a slight performance increase because fewer documents will match any given query, but I'm expecting it to not be something we can notice. We should just be able to go by the various improvements and feel confident merging this :)
Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing.
While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we really need to support returning NYC for queries for "the big apple"?).
Here are some variant names for some key places, just to record the kind of data that's in there:
NYC:
San Francisco:
China: