hypertidy / ncmeta

Tidy NetCDF metadata
https://hypertidy.github.io/ncmeta/
11 stars 5 forks source link

Tibble issue with some NetCDF4 files. #42

Closed dblodgett-usgs closed 2 years ago

dblodgett-usgs commented 2 years ago

Issue is coming from here. https://github.com/hypertidy/ncmeta/blob/master/R/nc_var.R#L30

What's your preferred fix here @mdsumner ?

f <- file.path(tempdir(), "temp.nc")
download.file("http://wrf-se-ak-ar5.s3.amazonaws.com/ccsm/hist/daily/1980/WRFDS_1980-01-01.nc", f, mode = "wb")

ncmeta::nc_meta(f)
#> Error:
#> ! Tibble columns must have compatible sizes.
#> • Size 2: Columns `filter_id` and `filter_params`.
#> • Size 3: Column `chunksizes`.
#> ℹ Only values of size one are recycled.

# it's because of this
nc <- RNetCDF::open.nc(f)
vi <- RNetCDF::var.inq.nc(nc, 1)
(vi <- vi[lengths(vi) > 1])
#> $dimids
#> [1] 3 2 0
#> 
#> $chunksizes
#> [1] 320 250   1
#> 
#> $filter_id
#> [1] 2 1
#> 
#> $filter_params
#> $filter_params[[1]]
#> [1] 4
#> 
#> $filter_params[[2]]
#> [1] 5

tibble::as_tibble(vi)
#> Error:
#> ! Tibble columns must have compatible sizes.
#> • Size 2: Columns `filter_id` and `filter_params`.
#> • Size 3: Columns `dimids` and `chunksizes`.
#> ℹ Only values of size one are recycled.

Created on 2022-07-25 by the reprex package (v2.0.1)

mdsumner commented 2 years ago

thanks! PR #43

dblodgett-usgs commented 2 years ago

Still getting the issue since some are size 3 and others size 2.

mdsumner commented 2 years ago

does that mean dimids and chunksizes are not related? or is there another > length (1) column?

I'll ust make list cols from each perhaps

mdsumner commented 2 years ago

I do need to revisit this properly, a sensible schema worked out - same could be done for gdal too and perhaps some logic reused

dblodgett-usgs commented 2 years ago

I honestly don't know what the filter_params an filter_id are, but those are what's causing the issue. The dimids and chunksizes should be the same size.

mdsumner commented 2 years ago

ok ! thanks, I thought I had it will look more closely 🙏

mjwoods commented 2 years ago

Hi all, the dimids and chunksizes should always be the same length. The filter_id and filter_params should also be the same length, but list members of filter_params can have different vector lengths (depending on the argument list used by each filter routine). The lengths of filter_id and dimids are not related to each other.

mjwoods commented 2 years ago

Correction - chunksizes can be NULL if a variable uses contiguous storage (i.e. is not chunked).

mjwoods commented 2 years ago

Here are the relevant definitions from the RNetCDF help on var.inq.nc:

ckluss commented 2 years ago

@mjwoods does it mean that the data are corrupt or that there is a issue in the RNetCDFlibrary? I've the same issue with data from the german weather service (DWD), see https://stackoverflow.com/questions/73307392/reading-netcdf-files-tibble-columns-must-have-compatible-sizes (so perhaps the problem could be opened as new RNetCDF issue?)

mjwoods commented 2 years ago

Hi @ckluss , RNetCDF is behaving as intended, because it is returning descriptive information about the filters applied to the variables in your dataset (i.e. compression). This information is provided by recent NetCDF library versions, and I added support for this feature to RNetCDF about a year ago. Unfortunately, the change has broken ncmeta when used on netcdf4 datasets with compressed variables. This breakage was not picked up by the existing tests, but I think we are close to a solution now. Once @mdsumner is satisfied that the solution works properly, I hope he can release an update for ncmeta. That should fix your problem.

mdsumner commented 2 years ago

this was auto-closed by commit

clairedavies commented 2 years ago

Hi, I think I am having a similar problem on a windows 64 machine and using R-4.2.1 & ncdf4_1_19.zip from https://cran.r-project.org/web/packages/ncdf4/index.html.

If I run either of these code lines R studio just hangs nc <- ncdf4::nc_open("http://thredds.aodn.org.au/thredds/dodsC/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2016/20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc") nc <- tidync::tidync("http://thredds.aodn.org.au/thredds/dodsC/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2016/20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc)")

However, if I download the file and run, it works fine. But I have thousands of these to run through so downloading isn't practical. nc <- ncdf4::nc_open("20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc")

With nc <- tidync::tidync("20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc") I get the following error Error: Tibble columns must have compatible sizes.

Any tips for finding a way around this issue? Thanks

mdsumner commented 2 years ago

that is a different problem, caused in ncmeta - I'll have a look in coming days 🙏

mjwoods commented 2 years ago

Hi @clairedavies , tidync::tidync works for me on Windows using the remote dataset in your example. You may need to install the latest versions of tidync and ncmeta.

Like you, I found that ncdf4::nc_open hangs with the remote dataset. You could try using RNetCDF::open.nc instead, which works for me on Windows (using the latest RNetCDF version). Please let me know if that works for you. If it does work, you could modify your code to replace ncdf4 commands by their equivalents from RNetCDF. For example, print.nc displays the structure of the dataset, var.get.nc reads variables, and att.get.nc reads attributes.

clairedavies commented 2 years ago

Thank you both for the responses I updated the packages and tidync::tidync works but R Studio still hangs on tidync::hyper_tibble()

If I use RNetCDF all I seem to get is NAs nc <- RNetCDF::open.nc("http://thredds.aodn.org.au/thredds/dodsC/IMOS/SRS/SST/ghrsst/L3S-1d/dn/2016/20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc") sst <- RNetCDF::var.get.nc(nc, variable = "sea_surface_temperature", start=c(20, 10, 1), count = c(50,50,1))

mdsumner commented 2 years ago

it's just a very sparse dataset so your start/count doesn't intersect the data at all - IMO you need a higher level tool than either RNetCDF or tidync for this source. With raster you get immediate helpful feedback and oversight of what's there. This file has poorly build lon/lat arrays - so some tools detect it as irregular where it is not. It' sentirely intended to be a regular grid in 70, 190, -70, 20 (xmin, xmax, ymin, ymax) with 0.02 resolution.

For example

library(raster)
r <- raster::raster(f, varname = "sea_surface_temperature")
r
class      : RasterLayer 
dimensions : 4500, 6000, 2.7e+07  (nrow, ncol, ncell)
resolution : 0.02, 0.02  (x, y)
extent     : 70, 190, -70, 20  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : 20160105092000-ABOM-L3S_GHRSST-SSTfnd-AVHRR_D-1d_dn.nc 
names      : sea.surface.foundation.temperature 
z-value    : 2016-01-05 09:20:00 
zvar       : sea_surface_temperature 

crop(r, extent(140, 160, -60, -40))
class      : RasterLayer 
dimensions : 1000, 1000, 1e+06  (nrow, ncol, ncell)
resolution : 0.02, 0.02  (x, y)
extent     : 140, 160, -60, -40  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
source     : memory
names      : sea.surface.foundation.temperature 
values     : 270.0998, 295.1398  (min, max)
time       : 2016-01-05 09:20:00 

Using raster and crop gives immediate and friendly tools for dealing with the data as a real map.

I'm sympathetic how confusing this can be, because even raster's replacement won't work with this source but for complex reasons that keep changing and are only a distraction imo. tidync is really for exploring the structure of a file, but this one is really simple just a regular grid with a few variables. hyper_tibble is a fast way of expanding subsets of the data to data frame, and this one is just too big to do in whole.

image

clairedavies commented 2 years ago

Thanks again = appreciate the help

mjwoods commented 2 years ago

Hi @clairedavies , as @mdsumner says, the NA values seem to represent missing values (e.g. land) on the lon/lat grid. Note that the variables in this dataset have been 'packed' as a form of compression, so you probably want to retrieve the unpacked data using the argument unpack=TRUE in var.get.nc.

clairedavies commented 2 years ago

Yes, I noticed that, even more confusing. I think I'm sorted now with RNetCDF. Thanks for all the help

Tananaevs commented 2 years ago

Hi all,

I still get the same issue with tidync::tidync and some of the CMIP6 climate model outputs, ncmeta 0.3.0, tidync 0.2.4 & RNetCDF 2.6-1, R 4.2.1. The packages seem to have been updated normally. The Tibble error reproduces on outputs from certain models, not all of them (i.e., CESM2-WACCM, but not CCMC-ESM2).

`

mjwoods commented 2 years ago

Hi @Tananaevs , to help us test the problem, could you please provide some links to datasets that are causing problems? Also, are you running R on Windows or something else?

mjwoods commented 2 years ago

Hi @mdsumner , before we invest too much time testing this problem further, I just want to check if you have published a new version of ncmeta since our previous attempt to fix the problem (#44).

Tananaevs commented 2 years ago

@mjwoods yes, I am running the latest RStudio v.2022.07.1 build 554 on Win10. The following link allows downloading - upon registration - a wget script that starts further download of the problematic datasets: https://esgf-node.llnl.gov/esg-search/wget/?distrib=false&dataset_id=CMIP6.CMIP.NCAR.CESM2-WACCM.historical.r1i1p1f1.day.tas.gn.v20190227|esgf-data.ucar.edu

mjwoods commented 2 years ago

Hi @Tananaevs , the tibble issue has been fixed in the ncmeta package source on github, but the new version has not yet been published on CRAN.

I tested the new ncmeta version successfully on Windows, as shown below. It would be helpful for us if you could test the new version across all the CMIP6 datasets.

install.packages("devtools")
devtools::install_github("https://github.com/hypertidy/ncmeta.git")

setwd(tempdir())
options(timeout=max(300, getOption("timeout")))
download.file("http://esgf-data.ucar.edu/thredds/fileServer/esg_dataroot/CMIP6/CMIP/NCAR/CESM2-WACCM/historical/r1i1p1f1/day/tas/gn/v20190227/tas_day_CESM2-WACCM_historical_r1i1p1f1_gn_18500101-18591231.nc",
  "test.nc", mode="wb")
tidync::tidync("test.nc")
file.remove("test.nc")
acrunyon commented 2 years ago

@mjwoods - thank you for publishing this solution! Do you have an estimate when it will be published to CRAN?

mdsumner commented 2 years ago

imminent 🙏

mdsumner commented 2 years ago

I believe this is fixed, tested on the cases reported and now on CRAN as ncmeta 0.3.5. Thanks!

mjwoods commented 2 years ago

Thanks @mdsumner !

lisaholsinger commented 1 year ago

Greetings. I am having similar issues with hyper_tibble().

I am using R 4.2.1, ncmeta 0.3.5, RNetCDF 2.6.1, ncdf4 1.19, tidync 0.3.0 in Windows 10

Here is the code below, which previously worked in a R 3.6.2 but now hangs up on the hyper_tibble command.
I sure would appreciate any advice! Thank you in advance.

library(RNetCDF) library(ncdf4) library(tidync) library(raster) library(tidyverse) library(ncmeta)

nc.grid <- tidync('http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_met_tmmx_1979_CurrentYear_CONUS.nc')

grid.yr <- nc.grid %>% activate("daily_maximum_temperature") %>% hyper_filter(day = dplyr::between(index, 14784, 14898),
lat = dplyr::between(index, 263,319), lon = dplyr::between(index, 387,462))

grid.yr <- grid.yr %>% hyper_tibble()