ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
315 stars 46 forks source link

sparse storage for sf data.frame? #79

Closed mpadge closed 6 years ago

mpadge commented 7 years ago

osmdata returns potentially enormous tables mostly full of NA values (for the non-geom bits). The R objects can be potentially wastefully huge. It'd be very easy to write a sparse storage routine for these non-geom bits of the sf data.frame to make the objects much smaller.

Note: I just chatted to Edzer about this, and he said that sf is purely about the geom column, and the package will never be adapted to do anything with the other columns, so according to him this is entirely out of scope for sf.

Robinlovelace commented 7 years ago

Yes but that surely does not stop you having a list column for a non geometry column?

mdsumner commented 7 years ago

It's interesting how sparse these data frames are. I'm thinking of a tidyr::gather like approach that also stores the id or row number of the original, drop NAs and expand that into the sf data frame when really necessary.

In a sense this is a "groupings higher than feature level" problem, like (counties nested within states), so it does matter if those kinds of groupings exist in these data.

It seems like all non-geometry data could be character, is it important that they are factor? I suspect the during download of the data is in dense form and it's only in the expansion into data framewhere the NAs become fleshed out, so it's probably better to look at the pre-sf form first. Does that sound sensible?

I'm not sure if there's a "pack data frame" capability already, but I imagine there is. (widyr might be useful)

mpadge commented 7 years ago

Ah, actually the sparse data frames are pretty deeply hard-coded at the initial point at which the hierarchical OSM structure is translated into internal data structures (example here). This was initially done because of envisioning a direct sf output, but I avail myself to stand accused of some combination of laziness or lack of programming vision. I guess a better solution would/could have been to have kept the hierarchical but efficient structure until later in the processing.

For the moment, I'd suggest the most effective solution will be to keep the core as it is and just implement an end-point work-around, which brings me to be the meta-point: @mdsumner I actually suspect this problem is general and potentially pervasive once people actually start using sf for bigger challenges. The end form spat out by osmdata is entirely GDAL-like, with just more numbers of columns. There must nevertheless be many applications in which GDAL will spit out similarly ultra-sparse data.frames stored as concomitantly wasteful full data.frames. This suggests to me the utility of a general solution which I suspect ought to be housed somewhere in your work, yet i'm not sure where would be best - spbabel? (Flagging again that Ezder entirely justifiably said sf is just about the geom column, and that the other columns are entirely unrelated to sf functionality and remain mere post-impositions of GDAL.)

Related issue: I foresee a similarly general need for an easy way to convert sp to sf, and sf does not currently meet the criterion of easy. A package wants sf, the developers simply Imports: sf, users try to install package and get heaps of error messages about out-of-date GDAL or no GEOS or whatever, what do they do? Turn away and find something else of course. There needs to be a simple sp2sf (as sf2sp of course), for which spbabel would seem nicely placed. If that were so, then sparse storage would also fit nicely within that.

And so ... if you suggest a likely place, we can happily move this discussion there, and ultimately work towards resolving this much more minor issue along the way.

mdsumner commented 7 years ago

Non-geom columns aren't completely untouched, for instance the default plot will be painful for osmdata layers - you have to st_geometry(x) to get an single-panel plot (or use ggplot2 trunk geom_sf). The columns are also able to be tagged with their interpretation, are they "aggregate", "constant", or "identity"?: https://github.com/r-spatial/sf/blob/3d74cc99f2fbc861bbf19090b5798144da270920/R/agr.R This is powerful stuff, but not relevant to everyone - it's an aspect that constructors need to know about, along with sf_column, and also n_empty, crs, , precision and bbox that live on the geometry.

On the related issue, I also want a tight sf-core that is only about the classes, their constructors and their decompositions, translations. My best attempt at decomposition is in https://github.com/mdsumner/scsf/tree/master/R but there's a different approach in spbabel. Some of the official constructors are slow (or were) due to validation at the part level, very important for "wild data" coming in but if you know the part is fine it can be sped up a lot. I have these little DIY constructors in a few places, an example is here: https://github.com/mdsumner/spex/blob/master/R/qm_rasterToPolygons.r#L52

It's nice with sf how straightforward this is, I'm also not afraid to go around official APIs like I used to be, I don't think it's that dangerous or "unmaintainable", and it would be much easier and safer in one lightweight package.

I don't have strong opinions on where or how to do this, but very keen to agree on a shared approach - and there are many implementations already unfortunately. I believe Edzer would consider PRs along these lines, but I am unclear if separation from sf would be, it would be quite a lot of work, so I think we should just do it and wait and see if sf might use it in future. My efforts are themselves fragmented, but I think a "sf-builder / sf-decomposer" package from scratch is a good idea - we should only do this once!, and agree to keep it in sync with the official project. I would happily import a sf-core package into many of my projects, but it needs to be clear what the dangers are and what compromise is acceptable given divergence from the official tool.

I can collate a list of all the decomposers I know of, there are quite a few - leaflet and fasterize are two obvious ones, it might be worth spending enough time on this to make sure a shared core would be able to cater to other implementations. Maybe that doesn't matter given how relatively straightforward it is, and the constructors are the thing to really be concerned about.

mpadge commented 6 years ago

This will all be solved by silicate

Robinlovelace commented 6 years ago

Awesome stuff. I've still not done the Bristol data download but check out awesome work by @geoMADE here: https://github.com/ATFutures/who

mpadge commented 6 years ago

Yeah, I saw that great work - thanks @geoMADE

MAnalytics commented 6 years ago

Thank YOU, for the opportunity that you gave me.

Monsuru.

From: Robin [mailto:notifications@github.com] Sent: 05 January 2018 13:50 To: ropensci/osmdata osmdata@noreply.github.com Cc: Monsuru Adepeju M.O.Adepeju@leeds.ac.uk; Mention mention@noreply.github.com Subject: Re: [ropensci/osmdata] sparse storage for sf data.frame? (#79)

Awesome stuff. I've still not done the Bristol data download but check out awesome work by @geoMADEhttps://github.com/geomade here: https://github.com/ATFutures/who

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/osmdata/issues/79#issuecomment-355558988, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AaFk6yRDXOJzdfTt4LLPeRNooNbLimpQks5tHiidgaJpZM4OPvxG.