ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
316 stars 46 forks source link

Possible bug in trim_osmdata() #253

Open Mashin6 opened 2 years ago

Mashin6 commented 2 years ago

There might be possibly a bug in how trim_osmdata() filters the data. It seems to be removing some objects that should be in the final dataset.

library(tidyverse)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.2.1, PROJ 7.2.1
library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright

townsQ <- getbb("Southeastern Connecticut COG", featuretype = "boundary") %>% 
            opq()
townsQ$prefix <- '[out:xml][timeout:250];\n rel[name="Southeastern Connecticut COG"]; \n map_to_area->.a;(\n'
townsQ$features <- "[\"admin_level\"=\"8\"](area.a)"

towns <- townsQ %>% 
            osmdata_sf()

towns$osm_multipolygons |> ggplot() + geom_sf()


area <- getbb("Southeastern Connecticut COG", featuretype = "boundary", format_out = "polygon")
towns <- opq(bbox=area, timeout = 250) %>%
                add_osm_feature(key="admin_level", value="8") %>%
                osmdata_sf() %>%
                trim_osmdata(area)

towns$osm_multipolygons |> ggplot() + geom_sf()

packageVersion("osmdata")
#> [1] '0.1.8'
R.Version()$version.string
#> [1] "R version 4.1.1 (2021-08-10)"

Created on 2021-11-23 by the reprex package (v2.0.1)

mpadge commented 2 years ago

Thanks @Mashin6, and apologies for the delay in responding here. It's not a bug, it reflects the operation of the exclude parameter in the trim_osmdata() function:

library (ggplot2)
library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
area <- getbb("Southeastern Connecticut COG", featuretype = "boundary", format_out = "polygon")
towns <- opq(bbox = area) %>%
                add_osm_feature(key="admin_level", value="8") %>%
                osmdata_sf()
plot1 <- function (towns, exclude = TRUE) {
    towns |>
        trim_osmdata(area, exclude = exclude) |>
        magrittr::extract2 ("osm_multipolygons") |>
        ggplot() +
        geom_sf()
}
plot1 (towns, exclude = TRUE)
#> Loading required namespace: sf

plot1 (towns, exclude = FALSE)

Created on 2022-01-02 by the reprex package (v2.0.1.9000)

Mashin6 commented 2 years ago

Though e.g. Colchester is completely within SCCOG, because the same ways that define COG boundary also define the boundary of the town. So I would still expect it to be included. Hmm unless the polygon from Nominatim has wrong shape. I guess more reason to have server side trimming.

www.openstreetmap.org/relation/11065863 www.openstreetmap.org/relation/11059185

mpadge commented 2 years ago

Yeah, right, it's an issue with how to interpret multipolygons. The trim operation for Colchester is extracting the internal component of the multipolygon. In cases where internal components of multipolygons are all distinct and do not share any boundaries, they will generally (in OSM terms) define holes within the surrounding main polygon. It might thus be more appropriate to default to extracting the primary surrounding polygon, rather than the internal components. I'll re-open to implement that.

mpadge commented 2 years ago

@Mashin6 The above commit improves the trim_osmdata() function, but your issue still remains. It is, however, due to a bug in the OSM data set, as this reprex illustrates:

library (ggplot2)
library (sf) # load for sub-setting
#> Linking to GEOS 3.9.1, GDAL 3.3.1, PROJ 8.0.1; sf_use_s2() is TRUE
library (osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
area <- getbb("Southeastern Connecticut COG", featuretype = "boundary", format_out = "polygon")
towns <- opq(bbox =area) |>
                add_osm_feature(key="admin_level", value="8") |>
                osmdata_sf()
mp <- towns$osm_multipolygons$geometry [[22]] |>
    st_sfc (crs = 4326)
ggplot (mp) + geom_sf ()

bb <- st_sfc (st_polygon (list (area)), crs = 4326)
st_within (mp, bb)
#> Sparse geometry binary predicate list of length 1, where the predicate
#> was `within'
#>  1: (empty)

That result shows that the multipolygon does not lie within the bounding box. The following shows in a bit more detail why that happens (using sp for simple return of direct integer values), including first necessary step of transforming to non-elliptical CRS:

mp <- st_transform (mp, 3857) [[1]] [[1]] [[1]] # multipolygon
bb <- st_transform (bb, 3857) [[1]] [[1]]
pip <- sp::point.in.polygon (mp [, 1], mp [, 2],
                             bb [, 1], bb [, 2])
table (pip)
#> pip
#>   0   1   3 
#>   2  83 705

Created on 2022-01-03 by the reprex package (v2.0.1.9000)

The point.in.polygon function returns:

And so that result shows that 2 points on the boundary of Colchester lie geographically outside the bounding area. The overpass API correctly filters because it uses relation memberships based on OSM ID values. That works in this case even though it arguably shouldn't because the result is geographically invalid. The osmdata approach remains geographically valid, and will return identical results to server-side filtering as long as those non-geographic results are also geographically correct. In this (hopefully outlier) case, they are not geographically correct, and so osmdata does what I would consider to be the appropriate thing. Does that make sense?

Mashin6 commented 2 years ago

Thanks for the analysis. I think you are right, I checked and Nominatim version of geometry for SCCOG is different from current OSM version. It seems Nominatim does not update geometry in cases where only nodes are moved, but way or relation is not touched. Might be worth adding a warning message to trim_osmdata() about possibility that results might not always be correct.

_Side note, map_to_area; in OverPass does spatial subsetting of objects and does not need IDs. But it also includes objects that are only partially inside the area._

mpadge commented 2 years ago

I notionally agree with you about adding a note, and tried doing that, but the function actually does return results that are strictly correct, and according to the above example, in fact more correct than the internal overpass trimming. Correctness can only be judged in geometric terms, and the function in current form is geometrically correct. Feel free to have a go yourself if you can think of a smart way to phrase such a note (as an extension of the Note in trim_osmdata), but please ensure it clearly states that results will always be geometrically correct, even if they are ... unexpected, or whatever adjective might be used to describe cases like the above.

My attempt went something like this:

The operation is also a strictly geometric trimming, and may not necessarily yield expected results when used in combination with bounding polygons obtained from other sources, potentially including Nominatim.

That would only make sense if it were clear what expected results might look like, which it is not. Don't know how better to do that for now.

Mashin6 commented 2 years ago

Right, this is a very special case of using trim_osmdata. But since it is suggested in the help page of the function, having a sentence or two might save users some time when scratching their heads why were some objects being removed.

What about this?

Caution is advised when using polygons obtained from Nominatim via getbb(..., format_out = "polygon"|"sf_polygon"). These shapes can be outdated and thus could cause that trimming operation might not give results expected based on the current state of OSM data.

mpadge commented 2 years ago

That sounds great - could you please add that via a pull request? Thanks!