Closed mpadge closed 7 years ago
Reproduce with this:
Polygon(cbind(c(0, 0, 1), c(0, 1, 1)))
It's common in the real world to get shapes that only have as many coordinates as they need (like a triangle), but it's standard in GISy things to close the ring explicitly. I think it's so that you can be sure that you aren't a line, though a line can intersect its start point - there is a kind of rationale but it escapes me. Imagine a polygon attempted with only two coordinates and an automated additional third to close it, the pathologies you need to catch get a bit out of hand. :) At any rate, sp has only recently added this warning, it used to silently "close" by adding the extra coordinate.
@mdsumner Curious stuff this. sp
just closes 3-vertex polygons by sticking the first vertex on the end. The warning was generated by a polygon
nm1 x1 y1
nm2 x2 y2
nm1 x1 y1
and sp
just makes this (following the warning) into:
nm1 x1 y1
nm2 x2 y2
nm1 x1 y1
x1 y1
so there are then 3! repeated vertices in a four-vertex polygon. (And rownames are crucial in osmdata
so insertion of non-named rows is unacceptable.) The commit simply excludes these junk cases from the sp
output (so sp
geometries will categorically differ from sf
, but so be it).
that was another pathology I was trying to remember. This stuff was impossible to understand "empirically", the code had to be understood. You remind me too, this was a huge breakthrough for me personally: https://stat.ethz.ch/pipermail/r-sig-geo/2016-March/024214.html
It meant that the very simplest case, a triangle with three unique coordinates, would trigger auto-closing and also auto-re-winding in the most confusing way. It's an edge case, but also the first thing anyone tries. To the credit of the authors sp solved a much larger gap at the time, it wasn't ever designed for "componentry", but for various reasons no one's really dug into it much until recently.
This "sub-feature identity" (terminology?) of vertices, segments, paths is something that ggplot2/fortify ruffled pretty strongly too, and it's a big missing piece of SF. Is that "need for rownames" a ubiquitous feature in OSM? (I need to get across this OSM stuff)
You can probably see that I'm not even close to your "very simplest case" yet - i've just dealt with processing non-polygons with < 3 unique vertices. That now leaves me free to move on to your 3-vertex cases ...
rownames are crucial in OSM because they contain the OSM ids that allow everything to be hierarchically inter-related. It's what distinguishes osmdata
from GDAL
output - the latter dumps all rownames, while osmdata
keeps them, and in doing so, retains the ability to trace any and all hierarchical relationships through a given set of data. While I generally agree with the tidyverse aversion of any and all forms of row naming, it's actually hugely useful in Rcpp
because it allows objects (lists, vectors, matrices) of arbitrary class to always have character-valued (row)names. Mighty handy!
Ah I see - I understand the value, but the actual rownames in R are a bit flaky - I'm terrified of R's base rowname management as a "unique ID". Are they mighty handy because they are robust, or because they "exist long enough" to allow tracing "far enough"? (Consider that this stance was developed pre-dplyr join-nirvana, though - probably rownames are fine, but I was too lost in match/merge etc.).
This is why sc committed to memory heavy unique IDs for every row (it was a blocker originally, but I just went with it, to resolve later), it means at least in theory that an arbitrary set of inputs can be combined by rbinding each analogous table together. I see it as a unresolved issue, how to apply and persist unique IDs when they are valuable, but drop/compress them when a structural array-like-index is sufficient (i.e. cbind, or match).
new sc
issue is my attempt to move this important discussion to a more appropriate place. OSM auto-generates all IDs and ensures absolute uniqueness at all times throughout the entire world (they are long long integers
, so the world should never exceed that!). It's a non-issue there, but a very important general issue almost everywhere else, and yeah, rownames are not a solution.
This might be just a case of ill-formed OSM data, but that's very unlikely, so better check it out ...