mapbox / vtquery

Query some gosh darn vector tiles
BSD 2-Clause "Simplified" License
44 stars 15 forks source link

dedupe is too strong #65

Open mapsam opened 6 years ago

mapsam commented 6 years ago

Deduplication of features is too strong right now. Consider two buildings (polygons) as two unique features but they have few properties (or no properties) and no ID. For all intensive purposes these should be two unique features to avoid removing important data.

Perhaps it's best to only dedupe based on IDs for now, while we think about other ways to best dedupe with properties.

cc @flippmoke

mapsam commented 6 years ago

Just a few extra examples of where deduping is working and not working.

Two buildings with the same properties and only the closest shows up in the results. Likely that they don't have IDs and we are using the properties to dedupe.

vtquery-diff-features

Two parks/baseball pitches that have the same exact properties but are deemed as different features. Likely due to their IDs being unique.

vtquery-same-features

This makes me think we should only compare properties of features across tiles, not features and properties in the same tile. This still doesn't satisfy the situation where two buildings across tiles have the same properties and would be considered duplicates, though. It would continue to sold tile boundary duplicates though.

flippmoke commented 6 years ago

I am not sure there is a clear answer to the "right" way to do deduping. I think part of this is that it really depends on the type of data that exists:

Deduplication of features is too strong right now.

If you wanted to find the one closest building in OSM right now, it would be ideal to dedupe. If you wanted to find all the closest buildings, I feel that deduping might not be correct. The problems you have seen with multiple tiles does in fact make the results appear strange and I think it something we should heavily consider.

The problem comes down to the vast type of data that we can have in vector tiles. If you are attempting to find a specific rubber ducky that is closest to you, it can be quite complex. You could have a standard sized rubber ducky that fits quite well into a single tile, and it may be the only rubber ducky around.

image

However, you might also have a jumbo rubber ducky that spreads across multiple tiles and has false edges on it from the other tiles you query. In this case deduping is very good.

image

Additionally, there might be a set of rubber duckies in your tile and you want to know all the rubber duckies in your area. In this case deduping might be too agressive because it would think all ruber duckies are the same, because their properties are the same.

image

If all our rubber duckies have unique ids on them, then we do want to dedupe:

image

However, if they do not -- then we might be overwhelmed by the number of rubber duckies if we do not enable deduping.

image

Very simply put it is not always smooth sailing when you are looking for rubber duckies:

image

Therefore, I suggest that we allow users to decide if they want to dedupe or not. We could even set a flag for what type of deduping occurs.

mapsam commented 6 years ago

@flippmoke 🦆 ❤️

Totally agree. There's no perfect solution (unless we start unioning geometries, which isn't out of the question, but is out of scope of this issue).

I like the idea of providing options to the user, and think we can do a good job at keeping it simple in the code base, especially since we have the logic written already.

Examples of dedupe options (not saying we have to implement them all):

Maybe another way of breaking it out is: