Add geom_stars for ggplot plotting

adrfantini commented 6 years ago

A while back sf has implemented a geom_sf which provides a ggplot2 plotting interface for sf objects. And it is awesome.

stars imho should also provide something similar.

edzer commented 6 years ago

Good point! geom_raster would be the first candidate to look at; I also remember that @cpsievert worked on an (rOpenSci?) package for ggplot'ting of rasters: what was the motivation for that?

We would like to keep the crs & degree axes handling integrated in geom_sf, and seamless integration of geom_sf and geom_stars @hadley. We could start here by providing st_as_grob methods as in sf/R/grid.R ?

adrfantini commented 6 years ago

What one usually does when plotting raster objects with ggplot2 is turning them into data.frame and using geom_raster (which is what rastervis::gplot() does). However this has some limitations for non-regular grids and, more importantly for stars, for projected coordinates, which is something that geom_sf handles perfectly.

Currently, I think that the best way for plotting projected gridded data with ggplot is to transform the data to sf polygons (e.g. with spex::polygonize()) and using geom_sf(). This is of course extremely slow and inefficient, but allows to visualize datasets with different projections on the same grid without interpolation, something that is very useful (to me, at least).

It's be nice if geom_stars could do something like this in a more efficient way.

mdsumner commented 6 years ago

Have a look at this, grid is actually very good performance wise

https://rpubs.com/cyclemumner/299631

cpsievert commented 6 years ago

I also remember that @cpsievert worked on an (rOpenSci?) package for ggplot'ting of rasters: what was the motivation for that?

https://github.com/ropensci/plotdap#readme covers the use cases pretty succiently.

Currently the raster support converts rasters to sf polygons 😬 https://github.com/ropensci/plotdap/blob/49b38ffe0a61cd93c936f0296334e46208efed4e/R/plotdap.R#L317-L319

adrfantini commented 6 years ago

@cpsievert Note that there are MUCH faster polygonizing methods, see #13.

mdsumner commented 6 years ago

I would focus on using a quadmesh version of the raster, and possibly with a conversion to triangles of that, then use ggplot. No need for raster re-projection, you can reproject the quads or triangles, and a lot less handling of redundant coordinates.

library(sf)
## create a dummy raster for plotting with ggplot2
example(st_read)
r <- fasterize::fasterize(nc, 
                raster::raster(raster::extent(sf::st_bbox(nc)[c(1, 3, 2, 4)]), 
                               nrows = 300, ncols = 600, 
                               crs = st_crs(nc)$proj4string), field = "BIR74")

## dense mesh form, every unique corner coordinate in vb, indexed by ib per rgl idioms
qm <- quadmesh::quadmesh(r, z = r)
gtab <- tibble::tibble(x = qm$vb[1, qm$ib], y = qm$vb[2, qm$ib], 
                       quad = rep(seq_len(ncol(qm$ib)), each = 4), 
                       BIR74 = qm$vb[3, qm$ib]) %>% dplyr::filter(!is.na(BIR74))

## transform the expanded set, this could be done before or after expansion/filtering NA
gtab[c("easting", "northing")] <- 
  proj4::ptransform(as.matrix(gtab[c("x", "y")]) * pi/180, sf::st_crs(nc)$proj4string, 
                    "+init=epsg:2163")
library(ggplot2)
ggplot(gtab, aes(x = easting, y = northing, group = quad, fill = BIR74)) + geom_polygon()

The thing ggplot2 is missing is idioms to keep those indexes separate, what is essentially two tables - here one of primitives that index into the other, the corner coordinates. (The feature table is really a third table, and most map extensions to ggplot2 just handle that delayed join in different ways). I'd delay creating that long gtab table for as late as possible, since that's when the mesh expands - but many operations can be done on the coordinates or primitive indexes before that.

adrfantini commented 6 years ago

I'm currently still following the stars -> polygonize -> sf -> ggplot workflow, but it is extremely slow for large datasets. ggplot is not as fast as base graphics in general, but in this case we are talking orders of magnitude. Performance might be a major limitation if we intend to plot all the grid cells separately as geom_polygons.

adrfantini commented 5 years ago

Now that we have a very nice st_xy2sfc, maybe that can act as a bridge in the stars -> sf -> ggplot workflow? I'm not sure if a stars -> ggplot direct plotting is doable; going through sf is great but can be slow.

edzer commented 5 years ago

See vignette 3. The sf path is taken by geom_stars in case of a curvilinear grid; building upon the example now in the first vignette, we get

library(ggplot2)
ggplot() + geom_stars(data = prec_slice)

adrfantini commented 5 years ago

Oh wow, missed that. That's really, really awesome!

adrfantini commented 5 years ago

I'm playing around with this, look at these two plots:

library(ggplot2)
library(stars)
m = matrix(1:20, nrow = 5, ncol = 4)
x = c(0, 0.5, 1, 2, 4, 5)  # 6 numbers: boundaries!
y = c(0.3, 0.5, 1, 2, 2.2) # 5 numbers: boundaries!
(r = st_as_stars(list(m = m), dimensions = st_dimensions(x = x, y = y)))
st_crs(r) = 4326
ggplot() + geom_stars(data=r)
ggplot() + geom_stars(data=st_xy2sfc(r, as_points=FALSE))

The first one does not seem to be recognised as EPSG:4326, the second does. Is this intended behaviour?

edzer commented 5 years ago

Well, yes: you can't expect miracles to happen overnight. The first one spawns into geom_raster or geom_rect (the latter in this case), the last into geom_sf. Only geom_sf knows about CRS.

You (we) could decide to always go through geom_sf, but this will be very slow for somewhat larger rasters. Optimizing the number of pixels to plot is still out of reach, it seems, although I've brought it up with several people at the rstudio developer day in Austin, last month.

adrfantini commented 5 years ago

From my perspective, miracles are already happening overnight! ;) I understand, maybe this could be clarified a bit in the vignette?

I think this issue is now solved, if you prefer I can open one new issue with the above, or we can just keep it here, as you wish.

Re speed, other than downsampling, I'm not sure there's something we can do. Personally for publication plots this is not an issue, I don't mind waiting 5 or 10 minutes for a figure to be created. But yes, it can be annoying when testing and refining code.

edzer commented 5 years ago

I agree; let's leave this open until we've got the st_xy2sfc in geom_stars trick documented somewhere.

mdsumner commented 5 years ago

It's less about the number of pixels and more about having to specify the x and y and pixel value for each explicitly, right? rasterGrob will plot a matrix very efficiently, it's more that geom_raster needs a x, y, z table to infer the offset and delta from - but that requires a ggplot2 extension afaict - it should pass the matrix and transform through without tabular expansion. Did you discuss that aspect? I feel like that detail has never been properly aired.

adrfantini commented 5 years ago

I think it should be fast enough for anything that does not need to go through sf; ggplot plots matrices just fine, even using geom_raster is not too slow, but it can be atrocious to plot thousands of polygons.

edzer commented 5 years ago

fast enough .. until the moment you try to plot 10 x 10 subplots with each a sentinel 2 scene having 10000 x 10000 pixels. Right now downsampling is manual; crude device size indications are needed for the case someone forgets.

mdsumner commented 5 years ago

I also thought rasterGrob had low level resampling? rasterImage does

adrfantini commented 5 years ago

Personally that's a very rare case for me (we are talking about 1e12 pixels), but of course ymmv. I understand the wanting to replicate a feature that stars::plot already has, tho.

edzer commented 5 years ago

I guess ggplot2 uses rasterGrob inside geom_raster, but when doing that, how do you make sure the colors match the legend? Writing ggplot2 geoms seems like you'd have to learn a new language first.

adrfantini commented 5 years ago

One issue now is that downsample is currently ignored if sf=TRUE, as far as I can tell.

edzer commented 5 years ago

Have you tried? It should: https://github.com/r-spatial/stars/blob/master/R/tidyverse.R#L125-L126

adrfantini commented 5 years ago

I have:

library(stars)
library(ggplot2)
library(dplyr)
x = system.file("tif/L7_ETMs.tif", package = "stars") %>%
    read_stars %>%
    slice(x, 1:50) %>%
    slice(y, 1:50)
gg1 = ggplot() + geom_stars(data = x, downsample=5, sf=TRUE) + facet_wrap(~band)
gg2 = ggplot() + geom_stars(data = x, downsample=5, sf=FALSE) + facet_wrap(~band)

... and it now works. I copied the commands from the previous R session, where it did not work. I am baffled, but better this way. One thing to note re downsampling is that downsampling works in any dimension (not just x and y) and can be specified on a per dimension basis with an integer vector of the same lenght of the number of dimensions: e.g. c(1, 5, 1) works for the above 3D object. This could be clarified in the docs: for example, in the st_as_stars docs it only mentions if length 2, specifies downsampling rate in x and y. Maybe it could be worth to expose stars:::st_downsample directly, and link to its documentation?

rgzn commented 5 years ago

I'm curious, is the geom_stars() function supposed to work with proxy objects?

The base plot function seems to work fine but doing an equivalent ggplot call returns an error and hangs my system:

risk = read_stars(risk_tif, proxy = TRUE)
ggplot() + geom_stars(data = risk)

Error: Argument 3 must be length 84812236, not 1

edzer commented 5 years ago

Thanks! With this patch it should work if you specify a downsampling rate, as in

ggplot() + geom_stars(data = risk, downsample = 20)

to plot every 20-th row and column. As opposed to base plot, in ggplot we still don't know how many pixels need to be filled by the plot. @paleolimbot @thomasp85 @mdsumner

A potentially more user friendly approach (as of: it would not hang your system) would be to make some guess of the upper limit of pixels on the device (2000x2000?) and use that to set a lower limit, along with emitting a message or warning.

mdsumner commented 5 years ago

Great! I had meant to ask why downsample though? It seems asking for the window size is more useful, as in "downsample" can be defined in those terms , but the GDAL rasterio engine is clearly source-dim / output-dim in terms of interface.

From following @paleolimbot's lead in ggspatial it seems this is not easy, but I think it should be something we could ask of ggplot2 - sure, the actual device may not be instantiated, but doesn't that just mean the invalidation model is pulling data too early? Does invalidation allow for a fresh update, if the device is resized?

But, we can get dev.size("px") and divide that up by facet columns and rows as a rough heuristic? Frankly, I don't know what the right way is - these are just thoughts of something that's puzzled me for a long time.

edzer commented 5 years ago

One issue may be that ggplot creates objects, that can be saved, after which you open a very large pdf device, and then print it. I guess the dev.size() call would not reflect that large pdf one?

mdsumner commented 5 years ago

definitely, that's what I mean by invalidation - though now I'm confused about when the data-pull occurs so I might be totally off

paleolimbot commented 5 years ago

I think downsampling early is the way to go, although specifying pixel dimension is probably better since it would be reusable between rasters of different size. I have personally never been able to get dev.size to return the correct pixel dimensions for any graphics device within ggplot2, and so I think that lazy rendering is a battle for another day.

edzer commented 5 years ago

@mdsumner it happens in geom_stars(), and that means upon creation, not printing (plotting); after adding a print line in geom_stars():

> library(ggplot2)
> g = ggplot() + geom_stars(data = p, downsample = 100) + facet_wrap(~band)
[1] "downsampled!!"
# set up massive plot device here
> print(g) # too low resolution now?
# done

mdsumner commented 5 years ago

ah right, well that just means a standard default then, a basic heuristic of device dims and distribute that over facets? I hadn't fully grasped this before, it's helpful

edzer commented 5 years ago

Facets is the other: can we get the number of facets when we are in geom_stars()?

paleolimbot commented 5 years ago

If you override draw_layer() instead of draw_panel() or draw_group() then yes!

paleolimbot commented 5 years ago

That won't tell you rows or cols though (and it's probably not worth it to try to find them)

edzer commented 5 years ago

Rows/cols are not as important as the total number. Can you point us to an example where this is done?

paleolimbot commented 5 years ago

Copy and paste this:

https://github.com/tidyverse/ggplot2/blob/master/R/geom-.r#L74-L90

...and keep your other methods as-is. You can use length(unique(data$PANEL)) within Geom$draw_layer() to get the number of facets. You can add things to args and they will get passed down to your other draw methods.

adrfantini commented 5 years ago

When using geom_stars(..., sf=FALSE) the axis keep the same units as the data.
Instead, when using geom_sf or geom_stars(..., sf=TRUE) the axis and the graticule are shown in (I think) epsg:4326, which is nice if using spatial data. However, of course, it is also much slower.

Is there a technical reasoning behind this? As a user of primarily spatial data, I would like the graticule and the axis labels to always be in degrees if the stars object has a valid CRS.

LMK if I should open a new issue for this.

EDIT: MRE

system.file("tif/L7_ETMs.tif", package = "stars") %>% read_stars() -> x
ggplot() + geom_stars(data = x[,,,1], sf = TRUE)
ggplot() + geom_stars(data = x[,,,1])

edzer commented 5 years ago

You get an idea what is involved in doing this by studying the ggplot2 files that have -sf in them here. ggplot2 uses its own programming language essentially, which I haven't mastered so far.

paleolimbot commented 5 years ago

@edzer I'm afraid I gave you a more complex solution that was necessary. If you haven't discovered this already, data$PANEL is always a factor(), so you can examine the levels() to get the number of panels (without overriding Geom$draw_layer():

library(ggplot2)

GeomPoint2 <- ggproto(
  "GeomPoint2", GeomPoint,
  draw_panel = function(self, data, panel_params, coord, na.rm = FALSE) {
    message("There are ", length(levels(data$PANEL)), " panel(s_ in this plot")
    ggproto_parent(GeomPoint, self)$draw_panel(data, panel_params, coord, na.rm)
  }
)

geom_point2 <- function() {
  layer(geom = GeomPoint2, stat = "identity", position = "identity")
}

p <- ggplot(mpg, aes(cty, hwy)) + geom_point2()
p
#> There are 1 panel(s_ in this plot

p + facet_wrap(vars(class))
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot
#> There are 7 panel(s_ in this plot

^{Created on 2019-06-27 by the reprex package (v0.2.1)}

edzer commented 5 years ago

Thanks, I see, but how do I get access to that number of panels inside geom_point2?

paleolimbot commented 5 years ago

You can't. Is there a reason you need that information at spec time?

edzer commented 5 years ago

Then we could downsample the rasters before plotting.

paleolimbot commented 5 years ago

The "ggplot way" is to provide all the information necessary to do calculations in the constructor, then actually do the calculations as late as possible. Is there an exported downsampling function for stars objects already? I wonder if it would be cleaner to have something like sf_obj %>% downsample_to_px(500), then pass that object into geom_stars() (as you've currently written it) instead of adding an extra argument to geom_stars().

To have this really work with ggplot, the input to geom_stars() needs to be a data frame with a list-column of stars objects (like sfc for rasters), and there should be some kind of constructor that can turn a dimension (like a time dimension) into rows (so that you can facet by it). Each stars object in the list should be something that can be unambiguously turned into a grob when it gets passed as the data argument to a ggplot (i.e., only x, y, and band dimensions). I think PostGIS has a data structure something like this for rasters in databases, so there's some precedent there, even if it's a bit un-stars like.

If it's possible to make some kind of sfc for stars objects (with st_transform(), st_bbox(), and st_as_grob() methods), I'd be happy to write the Geom and Stat to handle them.

edzer commented 5 years ago

Sounds great! We have st_transform and st_bbox in place.

I can see this work, although still, the st_as_grob would like to know the dimensions of the plotting area, in pixels, so that it can for instance downsample a 10,000 x 10,000 sentinel-2 image to a 200 x 200 pixel area before assigning colors, rather than creating the full grob taking massive time and memory.

Also, for the st_as_grob I'd need to know the color scale and the way pixel values are to be mapped into colors. Are there examples of how this looks like? Are there also functions passed on for this mapping?

paleolimbot commented 5 years ago

I think that resampling is a separate task, and from my end it would be helpful to have this as a separate function (preferably a function that can do max_dims = c(200, 200) or something). I might have to call this twice...once to get the data range (where I could probably do 100x100 to estimate the min/max) and once before passing to st_as_grob().

In my dreams, the resampling happens during st_transform() (or maybe another function that also takes the destination CRS bounds), but I think that is a battle for another day.

For colors, it's probably best to take a (vectorized) function that take continuous values in and colours out so that you don't have to explicitly depend on ggplot2 scales. Then on my end I'll do something like this:

library(ggplot2)
scale <- scale_fill_viridis_c(limits = c(10, 20))
mapper <- scale$map
mapper(9:21)
#>  [1] "grey50"  "#440154" "#46286D" "#414487" "#3B5E8B" "#2A788E" "#2B9089"
#>  [8] "#22A884" "#5ABC6D" "#7AD151" "#C0DD40" "#FDE725" "grey50"

...and you can have something simpler for testing.

paleolimbot commented 5 years ago

Actually, from reading your code, it seems like min_dims or suggested_dims might be more appropriate than max_dims. As long as the user can request no downsampling somehow.

mdsumner commented 5 years ago

st_warp does the resampling already, it's gg's job to provide the dest argument - the extent, crs and dimensions of the target panel - as is done here: https://github.com/paleolimbot/ggspatial/blob/master/R/layer-spatial-stars.R#L308 That could work even if source and target crs haven't changed so there's no need for special cases.

I'm confused why that part is still being discussed, am I missing something? I got close enough to understanding the gg model but I can't spend much time on it atm sadly. I really want to!

paleolimbot commented 5 years ago

If st_as_grob() takes a dest template argument (and somewhere there is a function that constructs a raster template), I could certainly get on board with that! I would just prefer to have the details of raster resampling, cropping, and reprojection abstracted away from the ggplot code (and maintained by people that know what they are doing, which is to say, not me).

paleolimbot commented 5 years ago

This might be helpful for understanding the gg model: https://github.com/paleolimbot/ggdebug#how-it-all-comes-together

meteosimon commented 3 years ago

Hi there, I apologize in advance for double-posting, if this is the case, but something similar has been discussed in the thread a while ago.

When I use geom_stars, I combine it with geom_sf of the bounding box of the stars object. This make sure that the crs is recognized, however it changes the axis-labels (which is ok for me).

Here the difference:

library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
library(ggplot2)

system.file("tif/L7_ETMs.tif", package = "stars") %>% read_stars() -> x

ggplot() + geom_stars(data = x)

ggplot() + geom_stars(data = x) +
  geom_sf(data = st_as_sfc(st_bbox(x)), col = NA, fill = NA)

Now, I tried to wrap the geom_stars and the geom_sf in a new geom, say geom_stars2 ... but I failed horribly. Is there an easy way to wrap these two geoms? Sure, this is a ggplot task, but I (and maybe others) would only need it for this issue.

^{Created on 2021-07-05 by the reprex package (v1.0.0)}

r-spatial / stars

Add geom_stars for ggplot plotting #21