ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

"Tidier" raster package #54

Closed Nowosad closed 7 years ago

Nowosad commented 7 years ago

The raster package consists of method and classes for raster processing in R. It allows to read and write raster data, perform raster algebra and raster manipulations, work on large datasets due to its ability to process data in chunks, visualize raster data, and many more... However, this package and its development has some flaws:

My idea is to:

Potential problems:

noamross commented 7 years ago

Much needed! The R consortium-funded stars project is a follow-up to the sf project and addresses some of the issues above. Part of it is to "develop and discuss a migration path for the raster package (which has 43K lines of R, C and C++ code), and its functionality, into...new infrastructure." (https://github.com/edzer/stars/blob/master/PROPOSAL.md) . Perhaps @edzer could comment on this component and how work at the unconf could dovetail with the project?

edzer commented 7 years ago

When your goal is to get work done with the raster package, none of the "flaws" @Nowosad mentions are real flaws, all are of the "nice to have" category. The stars project tries to get these niceties, but also go beyond raster's limitations, as its possibility to reasonably get beyond its current limitations (all mentioned in the stars proposal) is quite limited. Hence, I would disrecommend to put large effort in forking. I'ts also quite unusual in the R community to fork packages, and I wonder whether @Nowosad contacted Robert to discuss his plans, and a way to jointly work on it.

It would be great if the unconf could come up with examples of things that you currently cannot do with raster because of its functionality (if there are any!!!), as well as come up with mock ups of workflows ("currently not working command sequences") that you'd like to work, with a detailed description what these should do. This is hard work: most people who think "hey that would be nice" never have given a thought to how this will exactly look like. sf demonstrates this: all issues mentioning dplyr cause a lot of discussion, as it is far from trivial how to do things for spatial data. Then: raster data (or arrays in general) are even harder than features, because unlike features they are arrays in more than one dimension.

I'm looking forward to what you will be able to get, here!!

Have to board now on my way to EGU!

mdsumner commented 7 years ago

I have a few pending fixes for raster, one contributed by someone else to speed up extract when the source us tiled on disk. Glad to hear there is interest in pushing this, and I'm happy to contribute. Roxygenizing, adding tests, and being gitable very much needed and would help others to contribute. But will be interested to hear Robert's thoughts ...

mdsumner commented 7 years ago

A major limitation now in raster is that there's no efficient way to extract bands with a crop from a native binary .grd, it wouldn't be too hard to do. I do this by using ff, but it doesn't scale because of its 32-bit index.

As Edzer points out, determining the cost benefit here is hard, I tend to bend raster by working around its weaknesses and leveraging it's many strengths, but teaching that is challenging or too specific to certain data, in my experience. Documenting common workflows and best practices given existing vagaries could also be valuable.

noamross commented 7 years ago

One thought is contacting Robert on his thoughts on moving to GitHub to ease people submitting patches. I suspect one barrier is that this can quickly lead to an overwhelming number of issues filed for a package that large and widely used. Over at #45 they're talking about the role people can play as a "GitHub concierge", and thus contribute to package development by triaging and managing issues, etc. This, even more than setting up testing and CI infrastructure, would go a long way towards easing ongoing development of the package. Perhaps some folks are interested in playing this role for raster.

Robinlovelace commented 7 years ago

I also think this is a worthy topic for consideration and would like to add to the mix the velox package which is fast but has an unnusual API. Cc @mem48 who has used velox: I'm sure Malcolm will be happy to see this discussion also.

mdsumner commented 7 years ago

Definitely agree about velox, I tried adding focal methods too and it wasn't too hard. I love the R6 interface, it's a really powerful way to wrap up known collections of data, with simplified interfaces.

I also have a tidync branch in ncdump that provides a dbplyr-like abstraction over netcdf, more general than raster's (though Lee's powerful for now) and that works to feed raster or ggplot2 or just raw arrays. It's still a bit unstable but it's working well.

mdsumner commented 7 years ago

"Less powerful" is what I meant ...

Robinlovelace commented 7 years ago

Interesting stuff @mdsumner, the raster world is more complex than I thought! From your perspective what would the most productive thing to work on briefly, or 'hack', be in this space? On velox, note that it has also been updated infrequently: https://github.com/hunzikp/velox/graphs/contributors

mdsumner commented 7 years ago

I'd say maintenance of raster is very valuable. Opening it up to community input with Git, and hopefully isolating out the grid logic for low-overhead use by other packages are big wins.

Personally I need db abstraction over netcdf and hdf5, and the ability for curvilinear coordinates. I've been using a combination of dplyr and raster on ocean model output to good effect, but the efficient index-based workflows are hard to explain - also the need/opportunity is rare. ....

noamross commented 7 years ago

FYI I wrote Robert Hijmans a couple of weeks ago to ask what he thought about moving raster to GitHub, with the idea that we could make that migration an unconf project and maybe also discuss setting up some volunteers to be be issue handlers (Per #45), so as to manage the likely increased inflow. But I haven't heard back.

mdsumner commented 7 years ago

Oh thanks for that, I might have written too if you hadn't said so. I am keen to lead the GitHub move but worried that commitments sap my focus for months on end like recently.

Nowosad commented 7 years ago

@noamross Same here - I wrote Robert Hijmans about a two weeks ago and didn't get an answer.

Nowosad commented 7 years ago

Thanks for the discussion. I'm closing the repo - we should try to contact Robert Hijmans before we start moving this package to github.

mdsumner commented 7 years ago

I suggest we don't actually need to move it. I am planning to start working against this mirror of the r-forge package:

https://github.com/rforge/raster

Without any more information, my approach will be to

1) fork the rforge mirror 2) create a branch 3) derive patches for rforge 4) commit patches to rforge svn (hopefully with Robert's oversight)

This way any changes I make and work with will be transparent and available, and easy to integrate in any future move.

More than happy to take suggestions, but I am happy to work this way. I will contact Robert directly when I have a patch I want to put in. (I have write-access to the r-forge repo but I've always sought review and oversight before committing)

Nowosad commented 7 years ago

Great @mdsumner ! Let's us know how it worked.