osmcode / osmium-tool

Command line tool for working with OpenStreetMap data based on the Osmium library.
https://osmcode.org/osmium-tool/
GNU General Public License v3.0
483 stars 104 forks source link

Improve the way `linear_tags` and `area_tags` work #264

Closed dagnelies closed 1 year ago

dagnelies commented 1 year ago

First of all, thanks for maintaining this great tool.

The default behavior is to export "closed ways" as both a LineString and a MultiPolygon ...this makes a lot of "duplication" by default. For instance, every little building is a "closed way" and will be present in the export twice, as LineString and MultiPolygon, which inflates the data quite a lot. Although this is the most obvious tag leading to the most duplication, other tags suffer the same duplication issue: highways, barriers...

After digging in the docs, you may think: "Ok, I can use the linear_tags and area_tags for that." but it does work out. The way it works now is to act as a "filter".

If you specify:

{
   "linear_tags": ["highway"],
   "area_tags": ["building"]
}

You will loose all other ways/areas not having one of the two tags.

IMHO it would be better to leave the filtering to the include_tags/exclude_tags options only. This would imply altering the behavior of linear_tags/area_tags:

The meaning of true/false/null would remain unaffected.

Just for reference, here is the current docs regarding area handling:

For a closed way (with the last node location the same as the first node location) the tags are checked: If the way has an area=yes tag, an area is created. If the way has an area=no tag, a linestring is created. An area tag with a value other than yes or no is ignored. The configuration settings area_tags and linear_tags can be used to augment the area check. If any of the tags matches the area_tags, an area is created. If any of the tags matches the linear_tags, a linestring is created. If both match, an area and a linestring is created. This is important because some objects have tags that make them both, an area and a linestring.

Also, perhaps adding the building tag to areas would be a sensible default since it makes a very substancial part of the duplication.

joto commented 1 year ago

The whole linear vs area thing is complex and can't really be solved with simple lists of tags. So Osmium can only solve some rather simple use cases here, if you need something more I suggest using something like osm2pgsql which has a complete configuration language built in which allows you much more freedom.

Because we need to keep backwards compatibility, any kind of change has also to be considered well. So if we want to change this at all it has to be in some way that old configs will still do the same thing.

dagnelies commented 1 year ago

Indeed, my suggestion would affect backwards compatibility. Therefore, I understand the reluctance ...even though I still find it really meaningful. Both because it's slightly unintuitive that these acts as filters too and because they are impractical to use. You cannot meaningfully decide which tag should be what without losing all other unlisted tags. This is quite harsh and makes usage of these options impractical ...I honestly wonder if people use it.

There is currently no way to avoid large data duplication. We are talking about lots of dupe data here, for example ~60% of ways are buildings and duplicated, which is quite a lot. But we cannot de-duplicate them without loosing the other tags as a side-effect. :/

For full backwards compatibility, indeed another option would be required.

{
   "linear_tags": ["highway"],
   "area_tags": ["building"],
   "include_unlisted_tags_as_both": true
}

...but that would make usage slightly awkward IMHO.

dagnelies commented 1 year ago

By the way, the side effect of the "breaking change" would be filtering less data than before in the worst case, while the fix would be to simply add the list in the "include_tags" option. (Edit edit: just tested it, would work as expected)

dagnelies commented 1 year ago

The result would also be more intuitive IMHO since the area_tags/linear_tags would strictly be responsible for how to handle geometry, while include_tags/exclude_tags would strictly be for filtering output. Instead of the current case where the area_tags/linear_tags are implicitely also an include_tags.

It's your call. I just wanted to state my point of view as a user.

joto commented 1 year ago

The include/exclude_tags do a different thing. They do not filter objects, but only those specific tags. They are used for getting rid of tags such as source which most people don't need which would otherwise clutter up the output. But they don't prevent the object with those tags to be written out.

What most people probably want is to set area_tags to some list of tags and then set linear_tags to null. This way you get all data as either area or linear with no duplication and nothing filtered out.

dagnelies commented 1 year ago

What most people probably want is to set area_tags to some list of tags and then set linear_tags to null. This way you get all data as either area or linear with no duplication and nothing filtered out.

That would be a possibility ....but it's kind of tricky to find out what the list should be

dagnelies commented 1 year ago

For the sake of completeness, here is the most common configuration found in various repositories:

"linear_tags":  ["highway", "barrier", "natural=coastline"],
"area_tags":    ["aeroway", "amenity", "building!=no", "landuse", "leisure", "man_made", "natural!=coastline"],
dieterdreist commented 1 year ago

sent from a phone

On 4 Apr 2023, at 12:26, Arnaud Dagnelies @.***> wrote:

For the sake of completeness, here is the most common configuration found in various repositories:

these are keys, not tags, doing it like this it is an oversimplification that will not work well, e.g. highway can describe a linear feature like a road or an are like a highway service area. The same is true for other keys

dagnelies commented 1 year ago

I do not claim it is ideal/perfect. It is merely the result of a github search for such configurations repository-wide of what people use in practice right now. It is indeed a very rough approximation.

That said, even it is an approximation to avoid duplicated data, I see "misinterpretation" of areas / ways as the smaller issue. The bigger issue IMHO is that the way people currently use it, it simply removes all ways where none of the keys/tags are in the lists.

joto commented 1 year ago

Thanks for the ideas @dagnelies, but we are keeping the current behaviour. Closing here.

dagnelies commented 1 year ago

Ok. Just for the sake of completeness, here is what I used in the end for my project to distinguish between ways and polygons:

"linear_tags":  ["highway", "natural=coastline", "waterway", "barrier", "wall", "footway", "bridge", "tunnel", "railway", "power", "crossing","area=no"],
"area_tags":    ["building", "surface", "landuse", "natural!=coastline", "amenity", "leisure", "water", "parking", "sport", "crossing", "golf","area!=no", "boundary", "wetland"],

This is likely not 100% perfect either, but should roughly keep most features while cutting down a sizeable amout of duplicates.