r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
559 stars 94 forks source link

input function generalization #62

Open mdsumner opened 6 years ago

mdsumner commented 6 years ago

Thoughts on the read_gdal/read_ncdf distinction:

We need an abstraction over the ncsub and RasterIO idioms, I don't see that either of them is a good user-interface, but at least ncsub is the general form if you take it to cbind(start, count, outlength).

Use that as the general form and provide a bunch of helpers, to convert between geotransform/extent-dim-res/coord-ranges/RasterIO/start-count-stride? I've kind of been collecting these little components slowly, not being sure how to express them all in a single place.

Users will want a bounding box in coordinate space, and an output size or resolution - and often they'll be created by drawing or dragging sliders rather than directly numerically, or by a bit of back and forth guesswork reading the extent/proxy. Does it make sense to use the NetCDF start/count model, and add NearestNeighbour resampling option for outlength (after getting from the source), and then the gdal version can convert those to offset/window/outsize and also allow added resampling options?

Moved from https://github.com/r-spatial/stars/issues/54, which included this reply from @edzer

transforming a curvilinear grid should be a simple matter of transforming the two coordinate matrices transforming any non-curvilinear grid should, at least optionally, return a curvilinear grid without doing any resampling.

mdsumner commented 6 years ago

Specific tasks that are prerequisites for more complete handling:

adrfantini commented 6 years ago

 heuristics to determine that 2D vars are curvilinear coords of 3D+ vars

If a netCDF file follows the CF-Conventions, this should be indicated by the coordinates attribute. If the file is properly made... which is not always the case. In that case, stars should have a way to fix things (which is more or less already in place right now).

mdsumner commented 6 years ago

Ah ok, thanks! I noticed something like that today, we were discussing means of matching lon,lat (in [x,y]) to var[x,y] - and I think it's reasonable to set up a heuristic to find it. If you can point out some examples (ncdump -h will suffice, but source files also if possible) that would help - I'll compare with the examples I have.

Are there ever 3D coords? (I've not seen them, ROMs uses lon,lat in [x,y] and applies a stretching in z for every pixel relative to the bottom).

I also think a heuristic will be fairly safe, it's only a bit iffy when there are dozens of variables and dimensions - and will save using that curvilinear = ll step. (I didn't really catch on when that was being discussed elsewhere). I'm working on making this easier to manage from ncmeta, along with groups (needed for s5p).

adrfantini commented 6 years ago

12 and #54 contain links to properly CF-Compliant files with 2, 3 and 4D variables, with spatial coordinates in LCC projection, which is the EURO-CORDEX grid we usually use.

Do you want me to fetch some examples of badly formatted models with unclear / missing grid information?

mdsumner commented 6 years ago

Yeah, I see now - that kind of robustness is not what I'm used to from ~18 or so years with NetCDF in the wild. Any examples that are motivating or useful to you will always be of interest!

I have to rethink some things, I need to make sure the way I go about things puts the good new formats first.

adrfantini commented 6 years ago

Two examples of models I had to deal with recently: https://drive.google.com/open?id=1nVyq6Nw7c0WaG1nE8p_0pwX0hoRNBnN5 https://drive.google.com/open?id=1p6-IR6NdH_tAlMYtgNcFXFjXWo3iaNAR

One does not have any projection information, but relies only on lat-lon arrays; one instead has some indication of the projection, but it is not formatted according to common practice.

Here is instead another example of a dataset I use often, which is projected on LAEA and has a proper grid description so should work OOTB: https://drive.google.com/open?id=1GiW_POsYnItSW59NXzlq2NDbLN8wvpU-

adrfantini commented 6 years ago

Another example can be found in #52, which presents a challenge since ob_trans seems to not be supported very well ATM.

edzer commented 6 years ago

wasn't ob_tran short for obscure transformation?

mdsumner commented 6 years ago

great! I'm trying to be systematic and make a record of your examples and learn piggyback:

https://github.com/mdsumner/weird.nc/releases

(the bundle weird.tar.gz shows up the Releases artefacts, it seems to float around so I think there's only the latest copy)

adrfantini commented 6 years ago

Obscene Bollocks maybe? Unfortunately it is an extremely common transformation, many models and observations run with rotated pole.

edzer commented 6 years ago

I think the problem lies in unclear (to me!) PROJ support; to get an idea just grep the rgdal code and docs on ob_tran. In case the transformation and its inverse are simple, when we have long/lat coordinates we may want to do it outside PROJ, for clarity.

edzer commented 6 years ago

See also this ticket.

mdsumner commented 6 years ago

Time for a new issue ... but seriously those LCC grids threw me a bit, I have to rethink some things. Once I've got ncmeta back in order I'll have a closer look at the Coordinate attribute. :)

mdsumner commented 6 years ago

I wonder if we should leverage GDAL for the grid interpretation, but then allow applying that to a NetCDF I/O workflow? There's a lot of smart work gone into GDAL to sort that out. The typical criticism of GDAL is that 1) it flattens arrays to bands 2) it over-interprets rather than using what's there - but with ob_tran, these projected grids, geolocation arrays and ground control points generally it's a very powerful and smart framework - in NetCDF you're basically inferring all that stuff and upgrading to it. We might take the GDAL 2D interpretation and apply it to stars' n-d arrays read directly, rather than by wrapping bands back up.

I'm not ready for this but I will try it out at some point.

adrfantini commented 6 years ago

Time for a new issue ... but seriously those LCC grids threw me a bit, I have to rethink some things. Once I've got ncmeta back in order I'll have a closer look at the Coordinate attribute. :)

For your convenience: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#coordinate-system

In short, 'good' netcdfs should have a coordinates and a grid_mapping for each projected variable. The latter should indicate which variable contains the projection information. There are alternative methods (e.g. a crs global attribute is also accepted).

adrfantini commented 6 years ago

I think the problem lies in unclear (to me!) PROJ support; to get an idea just grep the rgdal code and docs on ob_tran. In case the transformation and its inverse are simple, when we have long/lat coordinates we may want to do it outside PROJ, for clarity.

As far as I understand it ob_trans is basically just a coordinate traslation, should be relatively easy. https://proj4.org/operations/projections/ob_tran.html

mdsumner commented 5 years ago

Continuing from https://github.com/r-spatial/stars/issues/14#issuecomment-463576200 - the distinction between read_stars, read_ncdf

I've been thinking about this, my intention is to i) enable proxy read for read_ncdf ii) leverage the existing GDAL (and @edzer) logic for the auxiliary data iii) consider ways to generalize the front end/s. I do think we need both the NetCDF start/count and the GDAL source-window/out-window idioms, and we need some kind of wrappers that allow pure-index operation (as now) as well as coordinate-space operations. I think that's not well provided by stars yet, but in raster it's very powerful - extent, index-to-coord, index-ops, and I'll probably experiment with a raster front-end to the stars read to clarify my thoughts.

adrfantini commented 5 years ago

I might add that currently for some files read_ncdf and read_stars do not agree on dimension direction and naming, which is not ideal: (test file link)

filename = 'test_ncdf_stars.nc'
> read_stars(filename) %>% adrop %>% st_dimensions()
  from  to offset delta refsys point values    
x    1 464  -40.5  0.25     NA    NA   NULL [x]
y    1 201   75.5 -0.25     NA    NA   NULL [y]

> read_ncdf(filename) %>% adrop %>% st_dimensions()
          from  to offset delta refsys point values    
longitude    1 464  -40.5  0.25     NA    NA   NULL [x]
latitude     1 201  25.25  0.25     NA    NA   NULL [y]

Does stars prefer one of the two dimension ordering? IIRC the CF conventions did specify one, but I can't seem to find the specifics.

mdsumner commented 5 years ago

Good example thanks, definitely an open problem afaic

edzer commented 5 years ago

Good point. Currently, read_stars can't, as GDAL doesn't give you dimension names, and also doesn't tell you the order in which they come in the source file.

mdsumner commented 5 years ago

Gee, that's scary and surprising. I have pretty solid opinions about what should happen here (order before interpretation etc) - just FYI I don't feel comfortable about our ability to persecute this online, it definitely needs personal interaction, arm-waving and eyebrows as key discussion tools. :)

edzer commented 5 years ago

OK, I'm in - use a google hangout? together with @adrfantini @dblodgett-usgs you and me?

mdsumner commented 5 years ago

oh yeah, my afternoon next week I guess - Tuesday probably ok

edzer commented 5 years ago

sounds good; what time in CEST?

mdsumner commented 5 years ago

oh god, I see 1700 for me and 0700 for you but forever confused about TZ ...

adrfantini commented 5 years ago

We are on CET, UTC+1, you are on UTC+11 right? 0700 CET is too early for me, too much noise in the house of people getting ready for school etc. and too early to be at work either (commute time kills me...).

dblodgett-usgs commented 5 years ago

I'm on CST 0700 CET is 1200 CST. I much prefer morning in Europe / Evening in AU.

edzer commented 5 years ago

OK, @mdsumner please suggest a time and a medium.

mdsumner commented 5 years ago

Sorry, things got very busy this week I'll try to suggest something next week

edzer commented 5 years ago

Here's a suggestion: pkg stars will

adrfantini commented 5 years ago

Now it's you reading my mind.

mdsumner commented 5 years ago

But what does partial read look like? Requring n-dimensional bbox is rather awkward, and implies the need for index-bbboxes as well as coordinate-bboxes, and then stride decimation for each axis. In tidync I'm happy that each dimension is available for coordinate or index expression filtering, without needing every dimension to be specified - doesn't include stride steps/output-dims (and algorithm choice) like GDAL allows.

That's the elephant in the room to me, I have no idea what that should look like.

edzer commented 5 years ago

Ultimately, I'd like to pass a dimensions object to the read function, and have it return a stars object with the data read according to those specifications, with hints on how to resample/interpolate in space and/or time. I think that is what gdalcubes does, right now; meeting with @appelmar next week to discuss integration. AFAICS this abstracts over large / messy (multi CRS) / warp-needing datsets.

The second challenge is to not downsample but process all the underlying data. The general approach however is similar to the first problem: index all underlying data, and go through it. The challenge here is that the strategy would depend on what you do: st_write.stars_proxy now makes spatial chunks, but you may need to chunk bands, attribute variable, or time.

mdsumner commented 5 years ago

So, user friendly helpers for dimensions - perhaps that's the thing that does all the proxy-ing, and read_stars breaks the lazy contract.

edzer commented 5 years ago

Yes; your stars_proxy object tells where all the data is, and st_as_stars could take a dimensions object with your data cube view, and read in only that part.