safe-refuge / safeway-data

Data mining tools for the Safeway app
4 stars 4 forks source link

Ignore duplicate points #45

Open littlepea opened 2 years ago

littlepea commented 2 years ago

If a new point has the same name as the existing one, within less than 100m, it will be rejected as a duplicate.

littlepea commented 2 years ago

Can use this function to calculate distance: https://stackoverflow.com/a/44743104

So the flow would be:

  1. Group points by name (slugified) - O(n)
  2. For each group with more than one point (possible duplicates) calculate the distance between each point in the group - O(k^2)
  3. remove a duplicate with the least complete information

I suggest we add PointDeduplicator as the last step before writing CSV

moorchegue commented 2 years ago

Perhaps points could have a status field with: new, published, possible_duplicate, confirmed_duplicate? So that automatic flow would mark suspicious points, and a curator would make a final decision. Most examples of duplicates I've looked at are tough, we might have to sacrifice either completeness of the data (if we get rid of some false positives), or at least accuracy (correct name, phone or address are hard to determine).

littlepea commented 2 years ago

What about just applying the existing rules of the app backend? (same name + within 100m distance)

moorchegue commented 2 years ago

You mean backend would do the filtering of the data from CSV files? Yeah, that sounds interesting, that way we would preserve original data, and get rid of duplicates without the need of manual curation, which could be added later if needed.