ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
317 stars 45 forks source link

osmdata_sp failure #85

Closed mpadge closed 7 years ago

mpadge commented 7 years ago
> bbox <- c (-118.27, 34.02, -118.23, 34.07) # LA
> highways <- osmdata::opq (bbox = bbox) %>%
      osmdata::add_osm_feature (key = 'highway') %>%
      osmdata::osmdata_sp (quiet = FALSE)
Issuing query to Overpass API ...
Rate limit: 2
Query complete!
converting OSM data to sp format
Warning message:
In (function (coords, hole = as.logical(NA))  :
  less than 4 coordinates in polygon

This might be just a case of ill-formed OSM data, but that's very unlikely, so better check it out ...

mdsumner commented 7 years ago

Reproduce with this:

Polygon(cbind(c(0, 0, 1), c(0, 1, 1)))

https://github.com/cran/sp/commit/088469131f0697d148aa960a77b7537f50a76be5#diff-e2cb7a7e598b8bafc380e1bfcdcb339fR38

It's common in the real world to get shapes that only have as many coordinates as they need (like a triangle), but it's standard in GISy things to close the ring explicitly. I think it's so that you can be sure that you aren't a line, though a line can intersect its start point - there is a kind of rationale but it escapes me. Imagine a polygon attempted with only two coordinates and an automated additional third to close it, the pathologies you need to catch get a bit out of hand. :) At any rate, sp has only recently added this warning, it used to silently "close" by adding the extra coordinate.

mpadge commented 7 years ago

@mdsumner Curious stuff this. sp just closes 3-vertex polygons by sticking the first vertex on the end. The warning was generated by a polygon

nm1  x1  y1
nm2  x2  y2
nm1  x1  y1

and sp just makes this (following the warning) into:

nm1  x1  y1
nm2  x2  y2
nm1  x1  y1
     x1  y1

so there are then 3! repeated vertices in a four-vertex polygon. (And rownames are crucial in osmdata so insertion of non-named rows is unacceptable.) The commit simply excludes these junk cases from the sp output (so sp geometries will categorically differ from sf, but so be it).

mdsumner commented 7 years ago

that was another pathology I was trying to remember. This stuff was impossible to understand "empirically", the code had to be understood. You remind me too, this was a huge breakthrough for me personally: https://stat.ethz.ch/pipermail/r-sig-geo/2016-March/024214.html

It meant that the very simplest case, a triangle with three unique coordinates, would trigger auto-closing and also auto-re-winding in the most confusing way. It's an edge case, but also the first thing anyone tries. To the credit of the authors sp solved a much larger gap at the time, it wasn't ever designed for "componentry", but for various reasons no one's really dug into it much until recently.

This "sub-feature identity" (terminology?) of vertices, segments, paths is something that ggplot2/fortify ruffled pretty strongly too, and it's a big missing piece of SF. Is that "need for rownames" a ubiquitous feature in OSM? (I need to get across this OSM stuff)

mpadge commented 7 years ago

You can probably see that I'm not even close to your "very simplest case" yet - i've just dealt with processing non-polygons with < 3 unique vertices. That now leaves me free to move on to your 3-vertex cases ...

rownames are crucial in OSM because they contain the OSM ids that allow everything to be hierarchically inter-related. It's what distinguishes osmdata from GDAL output - the latter dumps all rownames, while osmdata keeps them, and in doing so, retains the ability to trace any and all hierarchical relationships through a given set of data. While I generally agree with the tidyverse aversion of any and all forms of row naming, it's actually hugely useful in Rcpp because it allows objects (lists, vectors, matrices) of arbitrary class to always have character-valued (row)names. Mighty handy!

mdsumner commented 7 years ago

Ah I see - I understand the value, but the actual rownames in R are a bit flaky - I'm terrified of R's base rowname management as a "unique ID". Are they mighty handy because they are robust, or because they "exist long enough" to allow tracing "far enough"? (Consider that this stance was developed pre-dplyr join-nirvana, though - probably rownames are fine, but I was too lost in match/merge etc.).

This is why sc committed to memory heavy unique IDs for every row (it was a blocker originally, but I just went with it, to resolve later), it means at least in theory that an arbitrary set of inputs can be combined by rbinding each analogous table together. I see it as a unresolved issue, how to apply and persist unique IDs when they are valuable, but drop/compress them when a structural array-like-index is sufficient (i.e. cbind, or match).

mpadge commented 7 years ago

new sc issue is my attempt to move this important discussion to a more appropriate place. OSM auto-generates all IDs and ensures absolute uniqueness at all times throughout the entire world (they are long long integers, so the world should never exceed that!). It's a non-issue there, but a very important general issue almost everywhere else, and yeah, rownames are not a solution.