openchargemap / ocm-data

Snapshots of current Open Charge Map data [deprecated]
https://openchargemap.org
28 stars 7 forks source link

Non unique POI ids #7

Open Minishlink opened 3 years ago

Minishlink commented 3 years ago

Hello,

It seems the poi ids/uuids are not uniques in the dataset. I see several instances of the POI 155399 for example. And you can see it in the API results too:

https://api.openchargemap.io/v3/poi/?output=json&chargepointid=155399&maxresults=200

How do you choose what POI to display? Do you take the first you encounter, or is it based on more elaborate filtering?

Thanks

Minishlink commented 3 years ago

Also noticed there are ~4% of the POIs that have duplicate coordinates (after filtering duplicate IDs)

webprofusion-chrisc commented 3 years ago

Thanks, no duplicate IDS should be impossible so thats a bug in the API results. There is only one instance of 155399 in the database and that's a a unique constraint,

Duplicate coordinates, possible or even likely but if you are using the raw data set they may not all have a published status. Some imports contain 50% duplicates and we then merge these into POIs. We are relying less on imports nowadays though.

webprofusion-chrisc commented 3 years ago

Yes the duplicate IDs were a caching failure. Our APIs servers sync independently sync from the master API and one of the API servers appears to have started serving bad results - the caches have now been reset.

Minishlink commented 3 years ago

Thanks! Did you update the export with the caching fix?

With the latest export as of now, I have the following data (after filtering non live POI or POI have no coordinates or POI that have AddressCleaningRequired):

As for the coordinates, after filtering these duplicated IDs, I have:

It seems to me that these are low precision coordinates, and that may explain the duplicates, I'll look more into it tomorrow

Minishlink commented 3 years ago

Here is a more precise sample list of IDs with duplicated coordinates :

[ 3344, 3485 ], [ 3345, 3486 ], [ 3346, 3487 ], [ 3347, 3488 ], [ 3348, 3489 ]

webprofusion-chrisc commented 3 years ago

Thanks, I think you've found a bug in our caching system - the exported data comes from one of the caches. The duplicates should be entirely identical but f not one would have a greater DateLastStatusUpdate. I'll get this fixed soon, thanks for finding the problem!

webprofusion-chrisc commented 3 years ago

Regarding the duplicate coordinates with low OCM ID, they are old data from 10 years ago- ultimately if nobody (users) wants to clean up the data and remove duplicate positions etc then it just doesn't get cleaned up. Over the years we have developed deduplication techniques during imports but that data was from long before that.