Open littlepea opened 2 years ago
Can use this function to calculate distance: https://stackoverflow.com/a/44743104
So the flow would be:
I suggest we add PointDeduplicator
as the last step before writing CSV
Perhaps points could have a status
field with: new
, published
, possible_duplicate
, confirmed_duplicate
? So that automatic flow would mark suspicious points, and a curator would make a final decision. Most examples of duplicates I've looked at are tough, we might have to sacrifice either completeness of the data (if we get rid of some false positives), or at least accuracy (correct name, phone or address are hard to determine).
What about just applying the existing rules of the app backend? (same name + within 100m distance)
You mean backend would do the filtering of the data from CSV files? Yeah, that sounds interesting, that way we would preserve original data, and get rid of duplicates without the need of manual curation, which could be added later if needed.
If a new point has the same name as the existing one, within less than 100m, it will be rejected as a duplicate.