ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
316 stars 46 forks source link

Bugs in getbb with MULTIPOLYGONS #195

Closed agila5 closed 4 years ago

agila5 commented 4 years ago

Hi! I was working on creating an example of transforming a sfobject with LINESTRING geometry into a linnet object and I think I found a bug in getbb. This is the problem.

# updates
remotes::install_github("ropensci/osmdata")
#> Skipping install of 'osmdata' from a github remote, the SHA1 (9560e249) has not changed since last install.
#>   Use `force = TRUE` to force installation

# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright
library(tmap)
tmap_mode("view")
#> tmap mode set to interactive viewing

# data
getbb("Isle of Wight", format_out = "data.frame") 
#>    place_id
#> 1 198178336
#> 2 198559528
#> 3    472111
#>                                                                  licence
#> 1 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#> 2 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#> 3 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#>   osm_type    osm_id                                    boundingbox        lat
#> 1 relation    154350  50.5746776, 50.7675562, -1.591801, -1.0627189 50.6710482
#> 2 relation   2534184 36.6441148, 37.154046, -76.9302665, -76.445301 36.8953677
#> 3     node 158313642 36.8876493, 36.9276493, -76.727735, -76.687735 36.9076493
#>                 lon
#> 1 -1.33271111331995
#> 2       -76.7248143
#> 3        -76.707735
#>                                                   display_name    class
#> 1           Isle of Wight, South East, England, United Kingdom boundary
#> 2                Isle of Wight County, Virginia, United States boundary
#> 3 Isle of Wight, Isle of Wight County, Virginia, United States    place
#>             type importance
#> 1 administrative  0.8901035
#> 2 administrative  0.8013287
#> 3         hamlet  0.5990423
#>                                                                                       icon
#> 1 https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png
#> 2 https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png
#> 3           https://nominatim.openstreetmap.org/images/mapicons/poi_place_village.p.20.png

There are three matches with my query and I'm interested only in the first one. So I extracted the polygon format and I found something weird.

str(getbb("Isle of Wight", format_out = "polygon"))
#> List of 2
#>  $ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>  $ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...

There are only two matches (but that's not a problem since the third row of the data.frame output was just a point so it makes sense to exclude that point from the result). The problem is that the bb associated with the Isle of Wight located in the South East of England is the second element of the list and the result is clearly wrong since there are only 23 coordinates. For example, this is a map of that result:

tm_shape(getbb("Isle of Wight", format_out = "sf_polygon", poly_num = c(2, 1))) +
  tm_polygons(col = "darkred") +
  tm_basemap("OpenStreetMap") +
  tm_view(set.view = 15)

It's not clear from the image that that POLYGON is in the Isle of Wight but you can recognize it if you run the code locally.

Then I checked the URL associated with that query, i.e. here and I think you can see that the geotext element of the first MULTIPOLYGON object is quite different from the result I get from getbb

IMO the problem is here https://github.com/ropensci/osmdata/blob/9560e249a1b4f9edb3a5bbd4b49e082fbad0fdd8/R/getbb.R#L257 which just calls https://github.com/ropensci/osmdata/blob/9560e249a1b4f9edb3a5bbd4b49e082fbad0fdd8/R/getbb.R#L343-L346

If I run that function locally in debug mode I get

# which(grepl(")", p))
# [1]   23   42 8373 8398

which explains why the second element for the list has 23 rows but I think that's not the intended result. Unfortunately I couldn't think of any easy fix of that problem.

Moreover when I run the getbb function locally in debug mode I found this line https://github.com/ropensci/osmdata/blob/9560e249a1b4f9edb3a5bbd4b49e082fbad0fdd8/R/getbb.R#L239 which IMO is also wrong since seq(obj) returns a sequence with length equals to the number of columns in a dataframe (i.e.)

seq(iris)
#> [1] 1 2 3 4 5

while I think in that line of code you are looking for matches in the rows of the obj object.

I hope it's clear.

Created on 2019-12-01 by the reprex package (v0.3.0)

mpadge commented 4 years ago

Thanks @agila5 - good sleuthing. I'll check it out asap

mpadge commented 4 years ago

I agree with you that there's something strange going on there, but I'm not so sure it's actually a bug. The multipolygon extraction in get1bdymultipolygon actually seems right, because it does return the first element on the nominatim data. This is what the data on the server look like:

image

And here's a reprex of what current code does:

library(magrittr)
q_url <- paste0 ("https://nominatim.openstreetmap.org/?q=",
                 "Isle%20of%20Wight&featuretype=settlement&",
                 "polygon_text=1&format=json&limit=10")
obj <- httr::RETRY("POST", q_url, times = 10) %>%
    httr::content(as = "text", encoding = "UTF-8", type = "application/xml") %>%
    jsonlite::fromJSON()
bn <- as.numeric(obj$boundingbox[[1]])
bb_mat <- matrix(c(bn[3:4], bn[1:2]), nrow = 2, byrow = TRUE)
dimnames(bb_mat) <- list(c("x", "y"), c("min", "max"))

indx_multi <- which (grepl ("MULTIPOLYGON", obj$geotext))
gt_p <- gt_mp <- NULL
gt_mp <- obj$geotext [indx_multi] %>%
    gsub ("MULTIPOLYGON\\(\\(\\(", "", .) %>%
    gsub ("\\)\\)\\)", "", .) %>%
    strsplit (split = ',')
indx_na <- rev (which (is.na (gt_mp)))
for (i in indx_na)
    gt_mp [[i]] <- NULL
# gt_mp is then a list of 1 item
# gt_mp <- lapply (gt_mp, function (i) get1bdymultipoly (i))
p <- gt_mp [[1]]
p <- p [1:min (which (grepl (")", p)))]
p <- vapply (p, function (i) gsub (")", "", i),
               character (1), USE.NAMES = FALSE)
t (cbind (vapply (p, function (i)
                  as.numeric (strsplit (i, split = ' ') [[1]]),
                  numeric (2), USE.NAMES = FALSE)))
#>            [,1]     [,2]
#>  [1,] -1.591801 50.66224
#>  [2,] -1.591789 50.66221
#>  [3,] -1.591729 50.66221
#>  [4,] -1.591604 50.66222
#>  [5,] -1.591516 50.66223
#>  [6,] -1.591391 50.66224
#>  [7,] -1.591220 50.66228
#>  [8,] -1.591024 50.66230
#>  [9,] -1.590832 50.66233
#> [10,] -1.590618 50.66236
#> [11,] -1.590445 50.66241
#> [12,] -1.590364 50.66243
#> [13,] -1.590480 50.66247
#> [14,] -1.590564 50.66252
#> [15,] -1.590731 50.66250
#> [16,] -1.590939 50.66246
#> [17,] -1.591193 50.66240
#> [18,] -1.591481 50.66235
#> [19,] -1.591564 50.66233
#> [20,] -1.591630 50.66229
#> [21,] -1.591705 50.66228
#> [22,] -1.591793 50.66227
#> [23,] -1.591801 50.66224

Created on 2019-12-04 by the reprex package (v0.3.0)

So it is correctly returning the first element of the multipolygon, which has 23 coordinates. Maybe there's something wrong with the actual multipolygon structure as specified in nominatim? Do you have any further insights, or information about what might actually be wrong here?


(And the second issue regarding seq: https://github.com/ropensci/osmdata/blob/9560e249a1b4f9edb3a5bbd4b49e082fbad0fdd8/R/getbb.R#L239 is okay, since obj is always a list with at least 1 item, and seq will always just enumerate the length of that list. Could arguably be safer to use seq_along, but not necessary there.)

mpadge commented 4 years ago

So the one that you want here is [[3]], which your result above indicated:

# which(grepl(")", p))
# [1]   23   42 8373 8398

meaning item sizes of:

  1. 23
  2. 42 - 23 - 1 = 18
  3. 8873 - 42 + 1 = 8830
  4. 8398 - 8373 + 1 = 24
  5. 8848 - 8398 + 1 = 49

Easy there to think that [[3]] is the one you want, but that's not a reliable diagnostic. Nor is it appropriate to automatically select the largest polygon, because for administrative boundaries in particular that can prevent selection of smaller scale boundaries by always preferring equivalent, larger scale ones... That said, this should then work, but clearly does not at all:

bb <- getbb("Isle of Wight", format_out = "sf_polygon", poly_num = 1) 

yet clearly fails:

image

... so there is something going on. Moreoever, that should work with poly_num = 3, but fails with subscript out of bounds

agila5 commented 4 years ago

Hi @mpadge, thanks for your analysis. I'm checking this problem again and I still don't get one point. What's your expected output for str(getbb("Isle of Wight", format_out = "polygon")) ? Because when I look at the URL of that query here I see:

  1. A multipolygon object with 5 elements;
  2. A polygon object
  3. A point

so I would expect to obtain 1) a list of 6 polygons or 2) a list of 2 elements where the first/second element is a list of 6 polygons. Does it make sense? At the moment I get the same output as in the original comment, i.e.

str(getbb("Isle of Wight", format_out = "polygon"))
#> List of 2
#>  $ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>  $ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...

Created on 2019-12-04 by the reprex package (v0.3.0)

which means that I get the fully polygon object and just the first polygon of the multipolygons.

(And the second issue regarding seq is okay, since obj is always a list with at least 1 item, and seq will always just enumerate the length of that list. Could arguably be safer to use seq_along, but not necessary there.

I think that you are missing my point. I was not pointing out the difference between seq and seq_along. I was saying that here

https://github.com/ropensci/osmdata/blob/ec45cb1608921d0eb3016c4f73513f5614d67ee0/R/getbb.R#L239

you are testing something wrt indx_multi, which was defined as

https://github.com/ropensci/osmdata/blob/ec45cb1608921d0eb3016c4f73513f5614d67ee0/R/getbb.R#L223

This means that indx_multi is a numeric vector of indexes of which ROWS of obj$geotext contains the string MULTIPOLYGON, right?. This implies that the other rows are not MULTIPOLYGONS, right? Is so, then IMO the following check

https://github.com/ropensci/osmdata/blob/ec45cb1608921d0eb3016c4f73513f5614d67ee0/R/getbb.R#L239

is wrong since you are testing a condition wrt to the COLUMNS of obj . For example, I think you are doing something like this:

indx_setosa <- which(grepl("setosa", iris$Species))
indx <- which(!(seq(iris) %in% indx_setosa))
indx
#> integer(0)

Created on 2019-12-04 by the reprex package (v0.3.0)

while what you intended is

indx_setosa <- which(grepl("setosa", iris$Species))
indx <- which(!(seq(nrow(iris)) %in% indx_setosa))
indx
#>   [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
#>  [19]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86
#>  [37]  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104
#>  [55] 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122
#>  [73] 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
#>  [91] 141 142 143 144 145 146 147 148 149 150

Created on 2019-12-04 by the reprex package (v0.3.0)

I hope it makes sense. If I'm wrong just ignore this issue and excuse me for messing with two separate questions in just one github issue.

mpadge commented 4 years ago

yes, @agila5, you're right that this is indeed a bug in getbb, which arises because the function was coded with the -- incorrect -- assumption that multipolygon boundaries would always be constructed as outer/inner polygons, but of course they can simply be lists of spatially disjoint polygons, as is the case with the Isle of Wight. So yes, getbb() should in this case return all of the polygons. The above commit now gives:

library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
bb <- getbb ("Isle of Wight", format_out = "polygon")
str (bb)
#> List of 2
#>  $ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>  $ :List of 5
#>   ..$ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:19, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:8331, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:25, 1:2] -1.1 -1.1 -1.1 -1.1 -1.1 ...
#>   ..$ : num [1:50, 1:2] -1.09 -1.09 -1.09 -1.09 -1.09 ...

Created on 2020-01-05 by the reprex package (v0.3.0)

This may present problems when automatically piping outputs to subsequent functions, so it may be better to unlist the nested lists, but then again the result directly reflects the nominatim structure, and unlisting would destroy that. So i suspect it's better to leave subsequent selection of a desired bounding polygon to the user, but please feel free to suggest or discuss any alternative.


I'll now re-open the issue and invite you to submit a PR to fix your second fault, which you did indeed identify. Changing that one line from seq(obj) to seq(nrow(obj)) is exactly the fix needed. Once you've done that, then I'll also add you to the official list of package contributors, also as a reflection of the contribution all of your insights have been here. Thanks in advance!

agila5 commented 4 years ago

This may present problems when automatically piping outputs to subsequent functions, so it may be better to unlist the nested lists, but then again the result directly reflects the nominatim structure, and unlisting would destroy that. So i suspect it's better to leave subsequent selection of a desired bounding polygon to the user, but please feel free to suggest or discuss any alternative.

I'm not sure but I think that the current API is fine. I don't know about the "typical" usage of getbb function but I've never had any problem with the output format, just with the results.

On a related topic, I think that it should be specified in the help page that the poly_num parameter works only on sf_output format and it cannot be used to select, for example, the third polygon from the second list of polygons. For example

# package
library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
str(getbb("isle of wight", format_out = "polygon", poly_num = c(2, 3)))
#> List of 2
#>  $ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>  $ :List of 5
#>   ..$ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:19, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:8331, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:25, 1:2] -1.1 -1.1 -1.1 -1.1 -1.1 ...
#>   ..$ : num [1:50, 1:2] -1.09 -1.09 -1.09 -1.09 -1.09 ...

Created on 2020-01-05 by the reprex package (v0.3.0)

I'm not sure if that's intended (but I think so) and I can add one sentence to the help page in the PR if you want.

mpadge commented 4 years ago

We actually should just have the sf version also return the full result, including all polygons and multipolygons. Reopening until that's done

mpadge commented 4 years ago

@agila5 That commit gives the following output:

setwd ("/data/mega/code/repos/ropensci/osmdata")
devtools::load_all (".", export_all = TRUE)
#> Loading osmdata
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
library (sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
bb <- getbb ("Isle of Wight", format_out = "polygon")
str (bb)
#> List of 2
#>  $ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>  $ :List of 5
#>   ..$ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:19, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:8331, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   ..$ : num [1:25, 1:2] -1.1 -1.1 -1.1 -1.1 -1.1 ...
#>   ..$ : num [1:50, 1:2] -1.09 -1.09 -1.09 -1.09 -1.09 ...
bbsf <- getbb ("Isle of Wight", format_out = "sf_polygon")
bbsf
#> $polygon
#> Simple feature collection with 1 feature and 0 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: -76.93027 ymin: 36.64411 xmax: -76.4453 ymax: 37.15405
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>                         geometry
#> 1 POLYGON ((-76.93027 36.7048...
#> 
#> $multipolygon
#> Simple feature collection with 1 feature and 0 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -1.591801 ymin: 50.57468 xmax: -1.062719 ymax: 50.76756
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>                         geometry
#> 1 MULTIPOLYGON (((-1.591801 5...
str (bbsf)
#> List of 2
#>  $ polygon     :Classes 'sf' and 'data.frame':   1 obs. of  1 variable:
#>   ..$ geometry:sfc_POLYGON of length 1; first list element: List of 1
#>   .. ..$ : num [1:1785, 1:2] -76.9 -76.9 -76.9 -76.9 -76.9 ...
#>   .. ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#>   ..- attr(*, "sf_column")= chr "geometry"
#>   ..- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: 
#>   .. ..- attr(*, "names")= chr(0) 
#>  $ multipolygon:Classes 'sf' and 'data.frame':   1 obs. of  1 variable:
#>   ..$ geometry:sfc_MULTIPOLYGON of length 1; first list element: List of 1
#>   .. ..$ :List of 5
#>   .. .. ..$ : num [1:23, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   .. .. ..$ : num [1:19, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   .. .. ..$ : num [1:8331, 1:2] -1.59 -1.59 -1.59 -1.59 -1.59 ...
#>   .. .. ..$ : num [1:25, 1:2] -1.1 -1.1 -1.1 -1.1 -1.1 ...
#>   .. .. ..$ : num [1:50, 1:2] -1.09 -1.09 -1.09 -1.09 -1.09 ...
#>   .. ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#>   ..- attr(*, "sf_column")= chr "geometry"
#>   ..- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: 
#>   .. ..- attr(*, "names")= chr(0)

Created on 2020-01-07 by the reprex package (v0.3.0) so all polygon formats return all nominatim polygon and multipolygon objects - and the sf bounding boxes immediately show that the polygon object is not the UK Isle of Wight, while the multipolygon is the desired result, and contains all islands.

agila5 commented 4 years ago

Ok, great and thanks! My only note is on the poly_num parameter because I think it's useless now. Do you plan to reintroduce it adding the possibility to automatically select one of the polygons returned by the query? If not, I think you should just remove that and update the docs.

agila5 commented 4 years ago

Hi @mpadge. I'm sorry to bother you again but I just need to ask another question.

When I use the getbb function (as in the next reprex) with format_out = "polygon" I see a list with two polygons and I'm interested only in the first one but the output of getbb function with format_out = "sf_polygon" is one sf object with two rows (i.e. the two polygons). Obviously is trivial to filter only the first polygon but I just wanted to ask if that's intended behaviour or not.

# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(osmdata)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright

getbb("Lecco, Italy", format_out = "data.frame")
#>    place_id
#> 1 198659884
#> 2 198006636
#> 3 254490413
#>                                                                  licence
#> 1 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#> 2 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#> 3 Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright
#>   osm_type   osm_id                                  boundingbox        lat
#> 1 relation    45639  45.6493752, 46.151716, 9.2437341, 9.5411056 45.9005485
#> 2 relation    46229 45.8121895, 45.9061432, 9.3667651, 9.4848958 45.8553764
#> 3     node 62505597   45.6953764, 46.0153764, 9.229605, 9.549605 45.8553764
#>                lon                    display_name    class           type
#> 1 9.41202482143963        Lecco, Lombardia, Italia boundary administrative
#> 2         9.389605 Lecco, Lombardia, 23900, Italia    place           city
#> 3         9.389605 Lecco, Lombardia, 23900, Italia    place           city
#>   importance
#> 1  0.7340101
#> 2  0.6544993
#> 3  0.6544993
#>                                                                                       icon
#> 1 https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png
#> 2              https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png
#> 3              https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png
str(getbb("Lecco, Italy", format_out = "polygon"))
#> List of 2
#>  $ : num [1:7763, 1:2] 9.24 9.25 9.25 9.25 9.25 ...
#>  $ : num [1:820, 1:2] 9.37 9.37 9.37 9.37 9.38 ...
getbb("Lecco, Italy", format_out = "sf_polygon")
#> Simple feature collection with 2 features and 0 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 9.243734 ymin: 45.64938 xmax: 9.541106 ymax: 46.15172
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#>                         geometry
#> 1 POLYGON ((9.243734 45.76648...
#> 2 POLYGON ((9.366765 45.87497...

Created on 2020-01-08 by the reprex package (v0.3.0)

mpadge commented 4 years ago

Yeah, that is intentional now, and I think this is okay behaviour, but am totally open to suggestions for alternative approaches. The poly_num argument has gone, and there is no longer any default selection of first polygon in cases returning multiple polygons. Default behaviour in all cases in now to return everything. Does that make sense?