r-spatial / stars

Spatiotemporal Arrays, Raster and Vector Data Cubes
https://r-spatial.github.io/stars/
Apache License 2.0
558 stars 94 forks source link

Three different stars object descriptions from the same file #673

Closed pvanlaake closed 7 months ago

pvanlaake commented 7 months ago

I have a file of monthly CMIP6 data:

> fn <- "~/CC/CMIP6/Amon/historical/tasmax_Amon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412_v20191108.nc"

I can read this file with three different functions: read_stars(), read_mdim() and read_ncdf(). All three, however, print different results:

> read_stars(fn)
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
                                       Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
Daily Maximum Near-Surface ... [K] 211.3256 263.8677 283.3676 278.6807 297.9296 315.6577
dimension(s):
     from   to offset delta  refsys                                      values x/y
x       1  192      0 1.875      NA                                        NULL [x]
y       1  144     90 -1.25      NA                                        NULL [y]
time    1 1980     NA    NA POSIXct 1850-01-16 12:00:00,...,2014-12-16 12:00:00    
> read_mdim(fn)
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
              Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
tasmax [K] 208.223 264.0227 283.8458 278.1252 297.9282 315.6577
dimension(s):
     from   to offset delta refsys                    values x/y
lon     1  192      0 1.875     NA                      NULL [x]
lat     1  144    -90  1.25     NA                      NULL [y]
time    1 1980     NA    NA   Date 1850-01-16,...,2014-12-16    
> suppressMessages(read_ncdf(fn))
stars object with 3 dimensions and 1 attribute
attribute(s), summary of first 1e+05 cells:
              Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
tasmax [K] 208.223 264.0227 283.8458 278.1252 297.9282 315.6577
dimension(s):
     from   to refsys                                      values x/y
lon     1  192 WGS 84                      [192] 0.9375,...,359.1 [x]
lat     1  144 WGS 84                      [144] -89.38,...,89.38 [y]
time    1 1980 CFtime 1850-01-16T12:00:00,...,2014-12-16T12:00:00

(Never mind the CFtime refsys in read_ncdf(), I am tinkering with the code to replace PCICt with CFtime. That's for another issue.)

Note that this is not a feature of the specific file that I am using here, it is consistent over whatever CF-compliant NetCDF file I throw at stars.

Apart from some differences that are in the "raise-my-eyebrows" domain (read_stars() renames dimensions to x and y, standard_name versus long_name for the variable, read_mdim() produces a Date for the time dimension, different representations of the x-y coordinate system), there is one more vexing issue.

read_stars() produces different summary statistics from the other two functions. Upon closer examination of the data in the underlying file I noted that read_stars() flips the y-axis, which is also observable from the inversion of the offset and delta values in the printed information. If I plot() a time slice then the result is ok for all functions so it does not appear to be an error. But then what to make of this?

edzer commented 7 months ago

Thanks, these are all good observations. The three different functions use three different code paths to get to the same array:

The summary stats are "summary of first 1e+05 cells", which depends on the order in the array and hence on the sign of y$delta: for read_stars this starts north, the others south. You can pass on n=Inf to print.stars to get the summary of the entire array, this should be identical for all.

Another observation is that only read_ncdf returns irregular x and y dimension values (no offset and delta): maybe they are nearly regularly spaced, and the tolerance for deciding they are to be considered regularly spaced is too tight.

The fact that each of them, plotted, gives the same image confirms that these are semantically nearly identical representations of the data.

pvanlaake commented 7 months ago

Interesting then that read_stars() and read_mdim() return different data organisation while both are based on GDAL.

Is there an interest to more closely align the information collected through each function? As mentioned, I am tinkering with ncdf.R and could do a proper fork and then send you a PR.

edzer commented 7 months ago

Interesting then that read_stars() and read_mdim() return different data organisation while both are based on GDAL.

Yes: one code base, two different data models

Is there an interest to more closely align the information collected through each function? As mentioned, I am tinkering with ncdf.R and could do a proper fork and then send you a PR.

read_stars() will always end up regular x and y dimensions, regardless whether they are regular in the netcdf file (another constraint of the RasterDataSet). It would be good if read_mdim() and read_ncdf agreed on this issue though, i.e. used the same tolerance.

I'm not sure what you are exactly tinkering with; @mdsumner and @dblodgett-usgs wrote read_ncdf(), so some approval from them would also be good.

pvanlaake commented 7 months ago

I am the developer of the CFtime package, supporting the full range of defined CF Metadata Convention calendars. I have worked with Michael over the last 6 months or so to make that functionality available in ncmeta. The dev version on GitHub now includes "extended" attributes, the only one of which is "time" (currently). Michael is working on a new release of ncmeta that will include this. Michael is thus well aware of my efforts and also my intention to look at stars. I am not sure how well Dave has been following recent developments but as ctb to ncmeta he may be automatically notified of changes.

I'll make a proposal in a new issue to present my ideas in more detail.

Closing here now.