Open mdsumner opened 6 years ago
Specific tasks that are prerequisites for more complete handling:
 heuristics to determine that 2D vars are curvilinear coords of 3D+ vars
If a netCDF file follows the CF-Conventions, this should be indicated by the coordinates
attribute. If the file is properly made... which is not always the case. In that case, stars
should have a way to fix things (which is more or less already in place right now).
Ah ok, thanks! I noticed something like that today, we were discussing means of matching lon,lat (in [x,y]) to var[x,y] - and I think it's reasonable to set up a heuristic to find it. If you can point out some examples (ncdump -h will suffice, but source files also if possible) that would help - I'll compare with the examples I have.
Are there ever 3D coords? (I've not seen them, ROMs uses lon,lat in [x,y] and applies a stretching in z for every pixel relative to the bottom).
I also think a heuristic will be fairly safe, it's only a bit iffy when there are dozens of variables and dimensions - and will save using that curvilinear = ll
step. (I didn't really catch on when that was being discussed elsewhere). I'm working on making this easier to manage from ncmeta
, along with groups (needed for s5p).
Do you want me to fetch some examples of badly formatted models with unclear / missing grid information?
Yeah, I see now - that kind of robustness is not what I'm used to from ~18 or so years with NetCDF in the wild. Any examples that are motivating or useful to you will always be of interest!
I have to rethink some things, I need to make sure the way I go about things puts the good new formats first.
Two examples of models I had to deal with recently: https://drive.google.com/open?id=1nVyq6Nw7c0WaG1nE8p_0pwX0hoRNBnN5 https://drive.google.com/open?id=1p6-IR6NdH_tAlMYtgNcFXFjXWo3iaNAR
One does not have any projection information, but relies only on lat-lon arrays; one instead has some indication of the projection, but it is not formatted according to common practice.
Here is instead another example of a dataset I use often, which is projected on LAEA and has a proper grid description so should work OOTB: https://drive.google.com/open?id=1GiW_POsYnItSW59NXzlq2NDbLN8wvpU-
Another example can be found in #52, which presents a challenge since ob_trans
seems to not be supported very well ATM.
wasn't ob_tran
short for obscure transformation?
great! I'm trying to be systematic and make a record of your examples and learn piggyback:
https://github.com/mdsumner/weird.nc/releases
(the bundle weird.tar.gz shows up the Releases artefacts, it seems to float around so I think there's only the latest copy)
Obscene Bollocks maybe? Unfortunately it is an extremely common transformation, many models and observations run with rotated pole.
I think the problem lies in unclear (to me!) PROJ support; to get an idea just grep the rgdal code and docs on ob_tran. In case the transformation and its inverse are simple, when we have long/lat coordinates we may want to do it outside PROJ, for clarity.
Time for a new issue ... but seriously those LCC grids threw me a bit, I have to rethink some things. Once I've got ncmeta back in order I'll have a closer look at the Coordinate attribute. :)
I wonder if we should leverage GDAL for the grid interpretation, but then allow applying that to a NetCDF I/O workflow? There's a lot of smart work gone into GDAL to sort that out. The typical criticism of GDAL is that 1) it flattens arrays to bands 2) it over-interprets rather than using what's there - but with ob_tran, these projected grids, geolocation arrays and ground control points generally it's a very powerful and smart framework - in NetCDF you're basically inferring all that stuff and upgrading to it. We might take the GDAL 2D interpretation and apply it to stars' n-d arrays read directly, rather than by wrapping bands back up.
I'm not ready for this but I will try it out at some point.
Time for a new issue ... but seriously those LCC grids threw me a bit, I have to rethink some things. Once I've got ncmeta back in order I'll have a closer look at the Coordinate attribute. :)
For your convenience: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#coordinate-system
In short, 'good' netcdfs should have a coordinates
and a grid_mapping
for each projected variable. The latter should indicate which variable contains the projection information. There are alternative methods (e.g. a crs
global attribute is also accepted).
I think the problem lies in unclear (to me!) PROJ support; to get an idea just grep the rgdal code and docs on ob_tran. In case the transformation and its inverse are simple, when we have long/lat coordinates we may want to do it outside PROJ, for clarity.
As far as I understand it ob_trans
is basically just a coordinate traslation, should be relatively easy.
https://proj4.org/operations/projections/ob_tran.html
Continuing from https://github.com/r-spatial/stars/issues/14#issuecomment-463576200 - the distinction between read_stars, read_ncdf
I've been thinking about this, my intention is to i) enable proxy read for read_ncdf ii) leverage the existing GDAL (and @edzer) logic for the auxiliary data iii) consider ways to generalize the front end/s. I do think we need both the NetCDF start/count and the GDAL source-window/out-window idioms, and we need some kind of wrappers that allow pure-index operation (as now) as well as coordinate-space operations. I think that's not well provided by stars yet, but in raster it's very powerful - extent, index-to-coord, index-ops, and I'll probably experiment with a raster front-end to the stars read to clarify my thoughts.
I might add that currently for some files read_ncdf
and read_stars
do not agree on dimension direction and naming, which is not ideal: (test file link)
filename = 'test_ncdf_stars.nc'
> read_stars(filename) %>% adrop %>% st_dimensions()
from to offset delta refsys point values
x 1 464 -40.5 0.25 NA NA NULL [x]
y 1 201 75.5 -0.25 NA NA NULL [y]
> read_ncdf(filename) %>% adrop %>% st_dimensions()
from to offset delta refsys point values
longitude 1 464 -40.5 0.25 NA NA NULL [x]
latitude 1 201 25.25 0.25 NA NA NULL [y]
Does stars
prefer one of the two dimension ordering? IIRC the CF conventions did specify one, but I can't seem to find the specifics.
Good example thanks, definitely an open problem afaic
Good point. Currently, read_stars
can't, as GDAL doesn't give you dimension names, and also doesn't tell you the order in which they come in the source file.
Gee, that's scary and surprising. I have pretty solid opinions about what should happen here (order before interpretation etc) - just FYI I don't feel comfortable about our ability to persecute this online, it definitely needs personal interaction, arm-waving and eyebrows as key discussion tools. :)
OK, I'm in - use a google hangout? together with @adrfantini @dblodgett-usgs you and me?
oh yeah, my afternoon next week I guess - Tuesday probably ok
sounds good; what time in CEST?
oh god, I see 1700 for me and 0700 for you but forever confused about TZ ...
We are on CET, UTC+1, you are on UTC+11 right? 0700 CET is too early for me, too much noise in the house of people getting ready for school etc. and too early to be at work either (commute time kills me...).
I'm on CST 0700 CET is 1200 CST. I much prefer morning in Europe / Evening in AU.
OK, @mdsumner please suggest a time and a medium.
Sorry, things got very busy this week I'll try to suggest something next week
Here's a suggestion: pkg stars
will
read_ncdf
, read_gdal
and read_stars
read_stars
when given a NetCDF file will try read_ncdf
first, and otherwise, or when it fails, try read_gdal
.Now it's you reading my mind.
But what does partial read look like? Requring n-dimensional bbox is rather awkward, and implies the need for index-bbboxes as well as coordinate-bboxes, and then stride decimation for each axis. In tidync I'm happy that each dimension is available for coordinate or index expression filtering, without needing every dimension to be specified - doesn't include stride steps/output-dims (and algorithm choice) like GDAL allows.
That's the elephant in the room to me, I have no idea what that should look like.
Ultimately, I'd like to pass a dimensions
object to the read function, and have it return a stars
object with the data read according to those specifications, with hints on how to resample/interpolate in space and/or time. I think that is what gdalcubes does, right now; meeting with @appelmar next week to discuss integration. AFAICS this abstracts over large / messy (multi CRS) / warp-needing datsets.
The second challenge is to not downsample but process all the underlying data. The general approach however is similar to the first problem: index all underlying data, and go through it. The challenge here is that the strategy would depend on what you do: st_write.stars_proxy
now makes spatial chunks, but you may need to chunk bands, attribute variable, or time.
So, user friendly helpers for dimensions
- perhaps that's the thing that does all the proxy-ing, and read_stars breaks the lazy contract.
Yes; your stars_proxy
object tells where all the data is, and st_as_stars
could take a dimensions
object with your data cube view, and read in only that part.
Thoughts on the read_gdal/read_ncdf distinction:
We need an abstraction over the ncsub and RasterIO idioms, I don't see that either of them is a good user-interface, but at least ncsub is the general form if you take it to cbind(start, count, outlength).
Use that as the general form and provide a bunch of helpers, to convert between geotransform/extent-dim-res/coord-ranges/RasterIO/start-count-stride? I've kind of been collecting these little components slowly, not being sure how to express them all in a single place.
Users will want a bounding box in coordinate space, and an output size or resolution - and often they'll be created by drawing or dragging sliders rather than directly numerically, or by a bit of back and forth guesswork reading the extent/proxy. Does it make sense to use the NetCDF start/count model, and add NearestNeighbour resampling option for outlength (after getting from the source), and then the gdal version can convert those to offset/window/outsize and also allow added resampling options?
Moved from https://github.com/r-spatial/stars/issues/54, which included this reply from @edzer