Remove variant names from index

orangejulius commented 3 years ago

Who's on First variant names are a useful collection of unofficial names for places, but they tend to be pretty messy. This PR explores the effect of removing them from indexing.

While there might be occasionally useful names in there, it seems like the majority are exact or near duplicates of more official names, or names that are so colloquial that they are not particularly useful (do we really need to support returning NYC for queries for "the big apple"?).

Here are some variant names for some key places, just to record the kind of data that's in there:

NYC:

Bigapple
NY City
NY Cty
New York City
New York Cty
Newyork
Newyorkcity
Novaiorque
Nycity
Thebigapple
Big Apple

San Francisco:

S Francisco
S. Francisco
SFO
Sanfran
Sanfrancisco
Frisco

China:

China - Peoples Republic
China Peoples Rep
China, People's Republic
Chinese
PR China
PR of China
People's Republic of China
Peoples Republic of China

missinglink commented 3 years ago

I'm :+1: for this, just interested to see if it causes any acceptance test failures before merging.

missinglink commented 3 years ago

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

I don't remember the history of this but it's likely the convention was copied across to other parts of the codebase.

missinglink commented 3 years ago

I thought things like Ville-Lumière, Frisco, Big Apple were supposed to be filed under 'colloquial' 🤷‍♂️

It's seems 'variant' is where toponyms go to die 😆

orangejulius commented 3 years ago

If this is successful we should open corresponding issues on placeholder/parser/spatial to ensure they follow suit.

Agreed, I was just thinking about that. The impact of all those extra names is probably even higher for Placeholder since it considers matches across the entire parent hierarchy.

orangejulius commented 3 years ago

Okay, the results from this are in and they look pretty good. There is a decent increase to the overall score of our autocomplete acceptance tests, and a big increase in some other test cases like top_us_cities and us_states. There's almost no difference to test suites that look at addresses, which is expected for a change to the WOF importer.

As far as I can see there are almost no significant regressions from this change. A few individual autocomplete characters here and there, but nothing that looks like a trend.

If I had to summarize, overall it looks like removing variant names has three main positive effects

Removing useless variant names allows desired results to score higher (the extra variant names mean Elasticsearch considers the name field to be longer, and thus gives a lower score)
Desired records that were previously erronously deduplicated due to variant names that happened to mach will now be displayed (the variant name could be either on the undesired record that remained or on the desired record that was removed)
Records that aren't desired are often removed from results, because the variant name that was causing it to be shown in results is now no longer included

Here's some cases that show off one or more of these.

New York, New York

This is a query that has often been tough to get right since there are several results we want near the top, and lots of chances for duplicates or undesirable records to sneak in.

The autocomplete results don't really tell the whole story, here's the results from the query before/after:

The test says that both New York city and county should appear in the results. I'd argue We should add New York State to that list. But in any case the removal of variant names mean that the desired results for the WOF city and county records score higher than before. This boosts them above the East New York locality and the New York City result from Geonames (hopefully we can remove that one completely via deduplication someday).

I think this also shows that when we really fix our scoring in https://github.com/pelias/pelias/issues/862, we'll see even more and better results like this.

Missouri

A common trend in city and state results is fixing issues where a record simply wasn't ever displayed because it would be deduplicated. For example, the state of Missouri would essentially never come up in results because it would be deduped with Missouri Township, MO, which has Missouri in its list of variant names.

Our deduplication code currently prefers more granular results (for example, locality over county or region) in these cases. We might want to make that a little bit more strict with something like https://github.com/pelias/api/issues/1557. A region and a locality with wildly different populations should probably not be considered duplicates if we can avoid that causing issues with places like Berlin.

There's still some deduplication related issues here that we should look at, many of them can be fixed with data updates.

Various cities now showing up earlier in autocomplete

Individually these are all not necessarily amazing changes, but I noticed a decent trend of cities showing up one or two characters earlier in results. When we're talking queries that are only 2-4 characters, that's actually a big deal!

Final notes

I was expecting a big of a decrease in index size for this change, since there are a reasonable number of variant names out there. But it turned out to only be about 5MB. I suppose there might be a slight performance increase because fewer documents will match any given query, but I'm expecting it to not be something we can notice. We should just be able to go by the various improvements and feel confident merging this :)

pelias / whosonfirst