pelias / api

HTTP API for Pelias Geocoder
http://pelias.io
MIT License
218 stars 162 forks source link

Deduplication issue with transit stops #1460

Open bboure opened 4 years ago

bboure commented 4 years ago

Hi there,

I encountered an issue concerning the dedupe strategy. Autocomplete/search for "Manneken Pis" does not return the little peeing guy I am expecting. Instead, only the Bus stop is returned.

After some debugging, I found that they are considered duplicates:

Then, the bus stop is preferred because it has a zipcode.

This issue can possible happen many times as bus/metro/train stops often have the name of a nearby famous venue.

Suggestion: Should we add a dedupe rule, maybe on category and/or addendum? venues with different categories should not be considered as duplicates. Although, this could generate real duplicates since venues on osm are often duplicated and do not necessarily have the same tags.

Alternatively, another solution could also involve popularity. In this case, the statue has a higher popularity than the bus stop. But if you are actually looking for the bus stop, then this does not work either.

Any idea?

Thanks

https://pelias.github.io/compare/#/v1/autocomplete?focus.point.lat=50.843183&focus.point.lon=4.371755&text=Manneken+pis&debug=0

orangejulius commented 4 years ago

Hi @bboure, Thanks for another well researched and described issue. I actually have seen the exact same behavior, specifically affecting transit stops. Transit is a case where specifically returning the transit stop, not merely another record of the same name, even if very nearby, is important, so we should probably fix this.

I agree with you that the best solution is probably to not consider records duplicates if they have different category values.

Deduplicating based on addendum data is an interesting idea. I could see it leading to a lot of noise, but also being useful, especially for custom data brought in through the csv-importer.

If you want to add logic to consider records with different category values ineligible for deduplication, I think we would gladly accept that PR.

@missinglink any thoughts here?

missinglink commented 4 years ago

Yeah agreed with what you both said, some thoughts..

orangejulius commented 4 years ago

Dang, couldn't have said it any better, each of those points is spot on

bboure commented 4 years ago

Thank you both for you feedback. I'll open a PR.

Question: When should we consider both records as duplicates? Should categories be completely equals? (array of same length and same content) or should we consider that if any of the categories match, they are duplicates?

And what if one of the venues' category is empty?

bboure commented 4 years ago

@orangejulius @missinglink I had some additional thoughts about something that may be a bit out of scope here, but related.

In the case 2 records are considered the same, when it gets to the isPreferred() function, and both records come from osm, should we prefer ways over relations, and relations over nodes?

I have seen some places duplicated in osm, and generally one of the duplicates is an old node that was replaced by a relation or a way.