osm-fr / osmose-backend

Part of osmose that runs the analysis, and send the results to the frontend.
GNU General Public License v3.0
94 stars 115 forks source link

Proposal for enhanced duplicate address verification #2089

Open polish96 opened 12 months ago

polish96 commented 12 months ago

I would like to propose an enhancement to "item 1010 - duplicated node" check. Currently, this check does not effectively identify duplicated addresses, where two nodes or lines share the same building number, street name, and locality name.

The current implementation of "item 1010 - duplicated node" fails to accurately detect duplicate addresses, hindering the tool's ability to identify nodes or lines with identical building numbers, street names, and locality names.

I suggest implementing a more robust algorithm for the "item 1010 - duplicated node" check that considers additional criteria, such as building numbers, street names, and locality names. This enhancement will enable Osmose to accurately identify and flag nodes or lines with duplicated addresses, providing a more comprehensive and valuable verification tool for the OSM community.

In the event that the proposed enhancement cannot be seamlessly integrated into the existing "item 1010 - duplicated node" check, an alternative solution could involve creating a new validation rule specifically designed to address the identified issue. This new rule could be assigned a distinct item number for clear reference and tracking purposes.

Famlam commented 12 months ago

Could you give example data? It's not fully clear to me what you mean

polish96 commented 12 months ago

Could you give example data? It's not fully clear to me what you mean

For example: https://www.openstreetmap.org/node/3987932537/ is a duplicate of https://www.openstreetmap.org/way/315587859/ and https://www.openstreetmap.org/node/3987932526 is a duplicate of https://www.openstreetmap.org/node/3987932530.

The algorithm should check:

  1. Whether "addr:city" is the same in both elements.
  2. Whether "addr:street" is the same in both elements.
  3. Whether "addr:housenumber" is the same in both elements (note that building numbers like 31A and 31a are considered different, and case sensitivity should be taken into account).
  4. Whether both elements are within a distance of no more than 5 km from each other (it's possible that two cities with the same name, street name, and building number are located next to each other).

If all the above conditions are met, the system should signal that the address is likely duplicated and needs to be checked and corrected.

Famlam commented 12 months ago

Ok, I understand. Addresses are always difficult, as for example two different shops/offices may exist with the same address (and buildings can contain multiple addresses). So if we implement it, we should probably limit it to "bare" addresses only, e.g. no other tags besides addr:* tags (and possibly selected tags like source etc). I'm unfortunately not too familiar with all the different address-tagging conventions worldwide...

Whether both elements are within a distance of no more than 5 km from each other

It's not impossible that two cities with the same street name have that street close to the shared city border, so we'd have to limit it to e.g. nodes within building perimeters or so, maybe with a buffer of a couple of meters, definitely not km ;) . (Also for performance reasons)

polish96 commented 11 months ago

So if we implement it, we should probably limit it to "bare" addresses only, e.g. no other tags besides addr:* tags (and possibly selected tags like source etc).

I share the same opinion, I think we should limit it to just the addresses without additional tags.

frodrigo commented 11 months ago

Note, there is already multiple checks here https://github.com/osm-fr/osmose-backend/blob/dev/analysers/analyser_osmosis_relation_associatedStreet.py#L615

polish96 commented 11 months ago

Note, there is already multiple checks here https://github.com/osm-fr/osmose-backend/blob/dev/analysers/analyser_osmosis_relation_associatedStreet.py#L615

If you mean 'item 2060 - street numbers,' I am aware that it exists. Unfortunately, I have searched through all 'class' elements belonging to 'item 2060,' and I couldn't find a tool that searches for duplicated addresses.