Closed adrfantini closed 1 year ago
There's no proxy for read_ncdf, did you try that with read_stars?
(I will get tidync on CRAN next week and show how I would go about this)
Can't test now since the cluster's 64GB queue is busy, but will. I expect read_stars(..., proxy = TRUE)
will work OK for accessing the file, but then any calculation (st_apply
) will probably fail. Thinking about it, It would be awesome if st_apply
was chunk-aware and could perform computations on a chunk at a time on proxy
items, saving on RAM, but I understand this is a very specific usecase which only happens for netCDF
(albeit I incur in this issue - performing chunk-aware computations - very often).
Made a start here, at least for reading via proxy for netcdf:
https://github.com/mdsumner/stars/commit/b19fe0dc84354974cae430204a50c54420a952e0
The crux for slicing is to pass the dimension expressions along to tidync::hyper_filter
, and this new function defaults to proxy (but won't work in the other ways stars_proxy does yet). Use the proxy to find out the form of the source:
remotes::install_github("mdsumner/stars@mdsumner-dev")
f <- "pp_ens_mean_0.25deg_reg_v19.0e.nc"
stars::read_stars_tidync(f)
stars_proxy object with 1 attribute in file:
$names
[1] "pp_ens_mean_0.25deg_reg_v19.0e.nc"
dimension(s):
from to offset delta refsys point values
longitude 1 464 -40.375 0.25 NA NA NULL
latitude 1 201 25.375 0.25 NA NA NULL
time 1 25202 0 1 NA NA NULL
Use the hyper_filter
functionality to hone in, and set proxy to FALSE to read:
stars::read_stars_tidync(f, time = index < 5, proxy = FALSE)
stars object with 3 dimensions and 1 attribute
attribute(s):
pp
Min. : 911.7
1st Qu.:1007.7
Median :1016.5
Mean :1015.6
3rd Qu.:1024.7
Max. :1046.7
NA's :325714
dimension(s):
from to offset delta refsys point values
latitude 1 201 25.375 0.25 NA NA NULL
longitude 1 464 -40.375 0.25 NA NA NULL
time 1 4 0 1 NA NA NULL
I'm interested in how stars might use this kind of interface, and the problem is always about showing the user the available dimensions, their lengths/ranges and so forth.
(The stars/dims object creation is waay better now too so I'm also going to refactor a few things).
Manually processing different chunks is an option and it works well, but I was wondering if some automated procedure can be applied. In general however I do not understand why a 18GB object can manage to bring a 64GB machine to its knees even if just using show
; that is a limitation that needs to be looked into. Maybe being lazy (proxy = TRUE
) by default would be a good idea on the long term.
Do you know how to determine he internal tiling scheme of NetCDF?
The ncdump -sh
output of a NetCDF-4 file will include "secret" attributes like:
lon:_Storage = "chunked" ;
lon:_ChunkSizes = 609, 659 ;
lon:_DeflateLevel = 5 ;
lon:_Shuffle = "true" ;
lon:_Endianness = "little" ;
The _ChunkSizes
attribute tells you the chunking scheme for that variable.
I don't think RNetCDF hits those hidden attributes though...
ncdf4
does although maybe not in all installs or versions of NetCDF-4? See the code below for an example.
download.file("https://github.com/r-spatial/stars/files/3112497/wrfout_hist_prcp_2002-09-01.nc4.zip",
"wrfout_hist_prcp_2002-09-01.nc4.zip")
unzip("wrfout_hist_prcp_2002-09-01.nc4.zip")
nc <- RNetCDF::open.nc("wrfout_hist_prcp_2002-09-01.nc4")
finq <- RNetCDF::file.inq.nc(nc)
dinq <- RNetCDF::dim.inq.nc(nc, 1)
vinq <- RNetCDF::var.inq.nc(nc, "lat")
lat_att_names <- lapply(c(0:(vinq$natts-1)), function(x) RNetCDF::att.inq.nc(nc, "lat", x))
lat_atts <- lapply(c(0:(vinq$natts-1)), function(x) RNetCDF::att.get.nc(nc, "lat", x))
ncd <- ncdf4::nc_open("wrfout_hist_prcp_2002-09-01.nc4")
ncd$format
ncd$var$prcp$chunksizes
ncdf4::ncatt_get(ncd, "prcp", "_ChunkSizes")
wrfout_hist_prcp_2002-09-01.nc4.zip
Excellent, thanks!
I usually do this with ncdf4
and never encountered problems.
I'm seeing
> ncd <- ncdf4::nc_open("wrfout_hist_prcp_2002-09-01.nc4")
> ncd$format
[1] "NC_FORMAT_NETCDF4_CLASSIC"
> ncd$var$prcp$chunksizes
[1] NA
> ncdf4::ncatt_get(ncd, "prcp", "_ChunkSizes")
$hasatt
[1] FALSE
$value
[1] 0
was that expected?
Yeah... the _ChunkSizes attribute is a "special" attribute that isn't accessible that way I guess. See the -s
flag here for more. I would expect it to be in the ncd$var$prcp$chunksizes
element but it's NA for some reason on my computer.
Weird. I suspect that _ChunkSizes
is NA in ncdf4
since the chunk size is equal to the variable size - that is, there is only one chunk in the file.
Ahh. that makes sense. But checking another file that is chunked for real -- too large to share the binary here -- I see the same thing. Documentation of the ncvar4
class on page 16 here doesn't include any description of the chunksizes element of the class. Ahh. Here it is.
Looks like the format has to be NC_FORMAT_NETCDF4
not NC_FORMAT_NETCDF4_CLASSIC
but the CLASSIC data model CAN use chunking!
I just did: nccopy -4 wrfout_hist_prcp_2002-09-01.nc4 wrfout_hist_prcp_2002-09-01.nc
and got:
> ncd <- ncdf4::nc_open("wrfout_hist_prcp_2002-09-01.nc")
> ncd$format
[1] "NC_FORMAT_NETCDF4"
> ncd$var$prcp$chunksizes
[1] 659 609 1
> ncdf4::ncatt_get(ncd, "prcp", "_ChunkSizes")
$hasatt
[1] FALSE
$value
[1] 0
Another point on this is that the variable 'pp' is internally a short integer, but gets expanded into double by the "unpacking" (application of scale/offset). In tidync that is controlled by the "raw_datavals" argument, i.e.
f <- "pp_ens_mean_0.25deg_reg_v19.0e.nc"
a <- tidync(f) %>% hyper_array(raw_datavals = TRUE)
which halves the size compared the unpacked version. Is that of interest? ( I'm working on making read_ncdf always use tidync, and default to proxy = TRUE
so I'll be changing a few things and hopefully expose this control over the raw read).
The decision to load everything in 8-byte doubles, rather than 4-byte integers when possible, was intentional. My reasoning was that a factor 2 is not worth the trouble (of e.g. having to postpone the scale/offset). When chunking is done anyway, one can get down to a factor-anything at that point.
Another point on this is that the variable 'pp' is internally a short integer, but gets expanded into double by the "unpacking" (application of scale/offset). In tidync that is controlled by the "raw_datavals" argument, i.e.
f <- "pp_ens_mean_0.25deg_reg_v19.0e.nc" a <- tidync(f) %>% hyper_array(raw_datavals = TRUE)
which halves the size compared the unpacked version. Is that of interest?
Yeah I sometimes use that same flag in ncdf::ncvar_get
, the way I see it is that since it's just a flag that sits there it might as well stay
( I'm working on making read_ncdf always use tidync, and default to
proxy = TRUE
so I'll be changing a few things and hopefully expose this control over the raw read).
That is great. What I do not understand is if there is any plan to make processing-by-chunk the default, once proxy = TRUE
. I do not even know if that's possible.
@mdsumner , which branch should I test right now for read_ncdf(proxy = TRUE)
? Can I go with the netcdf-dev
that you just defaulted to?
In an ideal world, I think, one would entirely get away from proxy=FALSE
unless the object is created in-memory, but even then you could write it to disk first and continue with proxy=TRUE
. Having said that, that would ask for ultimate flexibility in chunking approaches, and clever handling of cases that cannot be chunked at all (multidimensional segmentation? watersheds?) I'm not sure we can, or anyone could, ever get there, and still hide the chunking details from the user. See also the limitations a system like GEE has.
In an ideal world, I think, one would entirely get away from
proxy=FALSE
You mean TRUE
?
unless the object is created in-memory, but even then you could write it to disk first and continue with
proxy=TRUE
. Having said that, that would ask for ultimate flexibility in chunking approaches, and clever handling of cases that cannot be chunked at all (multidimensional segmentation? watersheds?) I'm not sure we can, or anyone could, ever get there, and still hide the chunking details from the user. See also the limitations a system like GEE has.
There will always be corner cases. In my personal experience handling chunks gracefully and intelligently would enormously help the handling of large datasets - but this of course might be biased towards what I do, which is climate analysis.
Some software assume a given chunking scheme; CDO for example silently assumes that datasets have a chunking along time of lenght 1; this makes it extremely fast (much faster than anything else) for all those files that comply (>90% in climate), even for some complex functions. On the other hand:
While the limitations are great, the plusses are such that CDO is by far the most dominant tool for data analysis in climate science.
Yes netcdf-dev, there is a new read_stars_tidync function. Units for dins and vars, and consolidation of the two funs is coming, was a good day :)
I think chunking is easy to do once the base tools are right, but auto-determining the best strategy is very hard (it's what DB systems do, by locking down the context very hard). I have some very big climate output use cases here and hoping some colleagues chip in soon too. Multiple sources is more important to abstract over first, IMO
What do you mean by multiple sources?
More than one file, Netcdf has this concept of unlimited dimension, which just spreads a long series over multiple files. With tidync and raadtools I've learnt to not create a dataset, rather provide a curated series of files of known structure. There's a lot of lazy potential here!
@mdsumner -- note that an "unlimited" dimension actually has more to do with how the file is structured than spreading across multiple files. An unlimited dimension has the data and time coordinates interleaved -- so if you want to keep writing records with unique time stamps, you can. It just so happens that if you want to tack files together or split them apart it's super easy if the file is already unlimited along time.
The problem with an unlimited dimension is that, to scan a 1d variable along the unlimited dimension (e.g. time) you have to move the disk head over all the interleaved data! Faster with ssd but still lots of atomic reads that would (in an unlimited file) be a contiguous block of bits.
So, for virtual file aggregation as would be the case for a proxy=TRUE
case, a collection of unlimited dimensions with time as the outer (most rapidly varying) dimension would be preferable as the initial scan to create the collection would be much faster.
@adrfantini -- the chunking of time being most common, IMHO, is a reflection of time being the most common default outer dimension (required in COARDS?). In NetCDF3, this just means each timestep worth of X/Y data is contiguous on disk so if you were to use that chunking pattern and not compress anything, it would function essentially the same.
@adrfantini -- the chunking of time being most common, IMHO, is a reflection of time being the most common default outer dimension (required in COARDS?). In NetCDF3, this just means each timestep worth of X/Y data is contiguous on disk so if you were to use that chunking pattern and not compress anything, it would function essentially the same.
Time is usually unlimited and its chunk is of lenght one for a very simple reason: climate models write our the results every timestep, one timestep at a time. So they just increment the time dimension by 1 and to avoid rewrites this implies that the time chunk must be 1. Not all climate models are subsequently postprocessed and rechunked, and even if they are this chunking pattern is usually retained, since it is somewhat of an untold standard.
Too true. Time being unlimited is also an artifact of nco
requiring unlimited dimensions for glueing multiple files together or splitting them apart.
Yes netcdf-dev, there is a new read_stars_tidync function. Units for dins and vars, and consolidation of the two funs is coming, was a good day :)
@mdsumner mdsumner/stars@netcdf-dev
? Because I did that and I do not have any read_stars_tidync
function.
oh sorry, it's only internal - I didn't think of that. It's still too early unless you like suffering chasing development - devtools::load_all()
is the fastest way to load everything and avoid tidync:::read_stars_tidync
syntax.
And you also can't pull a lazy_stars from this function, you can only run it again with proxy=FALSE
.
oh sorry, it's only internal - I didn't think of that. It's still too early unless you like ~suffering~ chasing development -
devtools::load_all()
is the fastest way to load everything and avoidtidync:::read_stars_tidync
syntax.
I'm having issues with this:
> devtools:install_github('mdsumner/stars@netcdf-dev')
....
> devtools::load_all('~/R/x86_64-pc-linux-gnu-library/3.5/stars/R/stars')
Loading stars
Skipping missing files: init.R, stars.R, read.R, sf.R, dimensions.R, values.R, plot.R, tidyverse.R, transform.R, ops.R, write.R, raster.R, sp.R, spacetime.R, ncdf.R, proxy.R, factors.R, rasterize.R, subset.R, warp.R, aggregate.R, xts.R, intervals.R, geom.R
Warning messages:
1: S3 methods ‘$<-.stars’, ‘[.dimension’, ‘[.dimensions’, ‘[.stars’, ‘[.stars_proxy’, ‘[<-.stars’, ‘dimnames<-.stars’, ‘st_crs<-.stars’, ‘Math.stars’, ‘Math.stars_proxy’, ‘Ops.stars’, ‘Ops.stars_proxy’, ‘adrop.stars’, ‘adrop.stars_proxy’, ‘aggregate.stars’, ‘aggregate.stars_proxy’, ‘aperm.stars’, ‘aperm.stars_proxy’, ‘as.data.frame.stars’, ‘c.stars’, ‘c.stars_proxy’, ‘cut.array’, ‘cut.matrix’, ‘cut.stars’, ‘dim.dimensions’, ‘dim.stars’, ‘dim.stars_proxy’, ‘dimnames.stars’, ‘drop_units.stars’, ‘image.stars’, ‘is.na.stars’, ‘merge.stars’, ‘merge.stars_proxy’, ‘plot.stars’, ‘plot.stars_proxy’, ‘predict.stars’, ‘predict.stars_proxy’, ‘print.dimensions’, ‘print.stars’, ‘print.stars_proxy’, ‘print.stars_raster’, ‘seq.dimension’, ‘split.stars’, ‘split.stars_proxy’, ‘st_apply.stars’, ‘st_apply.stars_proxy’, ‘st_ar [... truncated]
2: In setup_ns_exports(path, export_all, export_imports) :
Objects listed as exports, but not present in namespace: geom_stars, make_intervals, read_ncdf, read_stars, st_apply, st_as_stars, st_contour, st_dimensions, st_get_dimension_values, st_rasterize, st_redimension, st_set_dimensions, st_sfc2xy, st_warp, st_xy2sfc, write_stars
Would you please be able to provide a MRE?
EDIT:
Also, I cannot find tidync:::read_stars_tidync
after installing mdsumner/tidync
.
I wouldn't expect load_all to work like that, I was expecting to clone the repo and go from there
also I made a mistake! it's stars:::read_stars_tidync
I am opening a large-ish NetCDF file for testing purposes. It is a 25km E-OBS daily precipitation dataset;
201*464*25202
in size (332MB on disk, compressed).It can be downloaded from here (direct link to the file here).
I am working on a machine with 64GB of RAM. I can load the dataset:
The file takes up
18803524336 bytes
of RAM space (~17GB), which is consistent with its size (201*464*25202*8 = 18803514624
). When trying to access its content, even by justshow(y)
,stars
requests a lot of memory, causing the system to hang and kill the R process. However, I can work around the problem by issuinggc()
manually before theshow(y)
: this works OK.So, a few points:
stars
requesting multiple GB of RAM just forshow(y)
? Is this only for the summary statistics on the array? The RAM usage intop
goes from 18GB to 53GB just for theshow(y)
!gc()
manually was not usually required, but here it seems that R cannot handle the automatic cleaning before getting killed. Is there anything that can be done atstars
level for this? Maybe callinggc()
at the end of some functions?st_apply
function will fail due to excessive RAM requirements, despite the dataset being only 18GB and the system having 64GB. There is a lot of overhead involved.I understand this is problably an R limitation and not a
stars
limitation.