hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
96 stars 32 forks source link

Arctic_Control does not open #368

Closed MaceKuailv closed 8 months ago

MaceKuailv commented 1 year ago

I was looking at the datasets that we have, and the one called Arctic_Control cannot be opened. Is this something we know?

On a side note, how many of our data are netCDF? Do we want to convert some of them to zarr? (for example, the ones Joan has been using heavily)

Mikejmnez commented 1 year ago

I was looking at the datasets that we have, and the one called Arctic_Control cannot be opened. Is this something we know?

Can you share what is the error (if there is one)?

MaceKuailv commented 1 year ago

Sorry, I forgot to include that. Seems to be an error when combining the data.

Here is the full error message:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], line 1
----> 1 od = ospy.open_oceandataset.from_catalog("Arctic_Control")

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/oceanspy/open_oceandataset.py:141, in from_catalog(name, catalog_url)
    138     mtdt = cat[entry].metadata
    140     # Create ds
--> 141     ds = cat[entry].to_dask()
    142 else:
    143     # Pop args and metadata
    144     args = cat[entry].pop("args")

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
     67 def to_dask(self):
     68     """Return xarray object where variables are dask arrays"""
---> 69     return self.read_chunked()

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
     42 def read_chunked(self):
     43     """Return xarray object (which will have chunks)"""
---> 44     self._load_metadata()
     45     return self._ds

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake/source/base.py:279, in DataSourceBase._load_metadata(self)
    277 """load metadata only if needed"""
    278 if self._schema is None:
--> 279     self._schema = self._get_schema()
    280     self.dtype = self._schema.dtype
    281     self.shape = self._schema.shape

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:18, in DataSourceMixin._get_schema(self)
     15 self.urlpath = self._get_cache(self.urlpath)[0]
     17 if self._ds is None:
---> 18     self._open_dataset()
     20     metadata = {
     21         'dims': dict(self._ds.dims),
     22         'data_vars': {k: list(self._ds[k].coords)
     23                       for k in self._ds.data_vars.keys()},
     24         'coords': tuple(self._ds.coords.keys()),
     25     }
     26     if getattr(self, 'on_server', False):

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/netcdf.py:92, in NetCDFSource._open_dataset(self)
     88 else:
     89     # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
     90     url = fsspec.open(self.urlpath, **self.storage_options).open()
---> 92 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/backends/api.py:1053, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
   1049 try:
   1050     if combine == "nested":
   1051         # Combined nested list by successive concat and merge operations
   1052         # along each dimension, using structure given by "ids"
-> 1053         combined = _nested_combine(
   1054             datasets,
   1055             concat_dims=concat_dim,
   1056             compat=compat,
   1057             data_vars=data_vars,
   1058             coords=coords,
   1059             ids=ids,
   1060             join=join,
   1061             combine_attrs=combine_attrs,
   1062         )
   1063     elif combine == "by_coords":
   1064         # Redo ordering from coordinates, ignoring how they were ordered
   1065         # previously
   1066         combined = combine_by_coords(
   1067             datasets,
   1068             compat=compat,
   (...)
   1072             combine_attrs=combine_attrs,
   1073         )

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:359, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
    356 _check_shape_tile_ids(combined_ids)
    358 # Apply series of concatenate or merge operations along each dimension
--> 359 combined = _combine_nd(
    360     combined_ids,
    361     concat_dims,
    362     compat=compat,
    363     data_vars=data_vars,
    364     coords=coords,
    365     fill_value=fill_value,
    366     join=join,
    367     combine_attrs=combine_attrs,
    368 )
    369 return combined

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:235, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
    231 # Each iteration of this loop reduces the length of the tile_ids tuples
    232 # by one. It always combines along the first dimension, removing the first
    233 # element of the tuple
    234 for concat_dim in concat_dims:
--> 235     combined_ids = _combine_all_along_first_dim(
    236         combined_ids,
    237         dim=concat_dim,
    238         data_vars=data_vars,
    239         coords=coords,
    240         compat=compat,
    241         fill_value=fill_value,
    242         join=join,
    243         combine_attrs=combine_attrs,
    244     )
    245 (combined_ds,) = combined_ids.values()
    246 return combined_ds

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:270, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
    268     combined_ids = dict(sorted(group))
    269     datasets = combined_ids.values()
--> 270     new_combined_ids[new_id] = _combine_1d(
    271         datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
    272     )
    273 return new_combined_ids

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:293, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
    291 if concat_dim is not None:
    292     try:
--> 293         combined = concat(
    294             datasets,
    295             dim=concat_dim,
    296             data_vars=data_vars,
    297             coords=coords,
    298             compat=compat,
    299             fill_value=fill_value,
    300             join=join,
    301             combine_attrs=combine_attrs,
    302         )
    303     except ValueError as err:
    304         if "encountered unexpected variable" in str(err):

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:251, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
    239     return _dataarray_concat(
    240         objs,
    241         dim=dim,
   (...)
    248         combine_attrs=combine_attrs,
    249     )
    250 elif isinstance(first_obj, Dataset):
--> 251     return _dataset_concat(
    252         objs,
    253         dim=dim,
    254         data_vars=data_vars,
    255         coords=coords,
    256         compat=compat,
    257         positions=positions,
    258         fill_value=fill_value,
    259         join=join,
    260         combine_attrs=combine_attrs,
    261     )
    262 else:
    263     raise TypeError(
    264         "can only concatenate xarray Dataset and DataArray "
    265         f"objects, got {type(first_obj)}"
    266     )

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:507, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
    504     datasets = [cast(T_Dataset, ds.expand_dims(dim)) for ds in datasets]
    506 # determine which variables to concatenate
--> 507 concat_over, equals, concat_dim_lengths = _calc_concat_over(
    508     datasets, dim, dim_names, data_vars, coords, compat
    509 )
    511 # determine which variables to merge, and then merge them according to compat
    512 variables_to_merge = (coord_names | data_names) - concat_over - unlabeled_dims

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:408, in _calc_concat_over(datasets, dim, dim_names, data_vars, coords, compat)
    405         concat_over.update(opt)
    407 process_subset_opt(data_vars, "data_vars")
--> 408 process_subset_opt(coords, "coords")
    409 return concat_over, equals, concat_dim_lengths

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:340, in _calc_concat_over.<locals>.process_subset_opt(opt, subset)
    338     break
    339 elif len(variables) != len(datasets) and opt == "different":
--> 340     raise ValueError(
    341         f"{k!r} not present in all datasets and coords='different'. "
    342         f"Either add {k!r} to datasets where it is missing or "
    343         "specify coords='minimal'."
    344     )
    346 # first check without comparing values i.e. no computes
    347 for var in variables[1:]:

ValueError: 'X' not present in all datasets and coords='different'. Either add 'X' to datasets where it is missing or specify coords='minimal'.

We may as well just convert is to a single zarr file.

Mikejmnez commented 1 year ago

ValueError: 'X' not present in all datasets and coords='different'. Either add 'X' to datasets where it is missing or specify coords='minimal'.

Perhaps there is a simpler solution to this issue.

ThomasHaine commented 11 months ago

@MaceKuailv please follow up with @Mikejmnez to figure out the simpler solution.

Mikejmnez commented 11 months ago

Before fully transforming the entire dataset, I really wanted us to diagnose what was going on... I spent some time looking into this and this is what I found:

1) The file SIGMA0.nc in every directory has no coordinates.

path1 = '/home/idies/workspace/OceanCirculation/exp_Arctic_Control/days10950-12045/DIAGS/'
ds = xr.open_dataset(path1+'SIGMA0.nc')
ds.coords
Coordinates:
    *empty*

2) Within a single directory, the following works:

ds = xr.open_mfdataset(path1+'*.nc', drop_variables=['diag_levels'])

3) When trying to read the entire dataset across multiple directories, xarray cannot figure out how to best concatenate variables to combine datasets Because SIGMA0 does not have coordinates.

Excluding SIGMA0 works, and actually there is no need to specify arguments 'combine=nested', concat_dim='T'. This is the following works

ds = xr.open_mfdataset(list_files1 + list_files2, engine='netcdf4', parallel=True,  drop_variables=['diag_levels', 'iter'])

where

list_files1 = []
for var in ds.data_vars:
    if var != 'SIGMA0':
        list_files1.append(path1+var+'.nc')

and similarly for list_files2 with a different directory path2.

Including SIGMA0.nc does not work anymore because it does not have coordinates. The issue with SIGMA0 is a bit frustrating because it is a variable that gets dropped (see intake catalog entry). However, it seems that before dropping the variable, the xr.dataset needs to be created.....

Solutions

1) a) Modify each SIGMA0.nc file across all directories, so that it has consistent coordinates.

2) As Wenrui suggested, create a single zarr file containing all data, with well defined coordinates and dimensions.

3) a) Modify the intake catalog so that url_path contains the paths for all variables excluding SIGMA0.nc, similar to list_files1+list_files2.

ThomasHaine commented 11 months ago

I'm neutral about these solutions, although perhaps number 2 is best? I also think we should try to be consistent across all datasets, as much as possible.

Mikejmnez commented 10 months ago

I'm neutral about these solutions, although perhaps number 2 is best? I also think we should try to be consistent across all datasets, as much as possible.

sounds like a good plan.

ThomasHaine commented 10 months ago

@MaceKuailv can you implement this fix (number 2, single zarr file) above?

MaceKuailv commented 10 months ago

Yes, I will follow what open the file with "Mike's " method and convert them to zarr and create new files.

MaceKuailv commented 10 months ago
(py39) idies@fde1b55f22c5:~$ touch /home/idies/workspace/OceanCirculation/exp_Arctic_Control/hello_its_me.txt
touch: cannot touch '/home/idies/workspace/OceanCirculation/exp_Arctic_Control/hello_its_me.txt': Read-only file system

I could not access the original data from datascope, and could not write to filedb or OceanCirculation with my sciserver account "wenrui".

I have the file sitting in my scratch directory, I can move it over whenever I got the permission. As the soon-to-be third most senior member of the group, I think I am trustworthy enough.

ThomasHaine commented 10 months ago

Ask Gerard for permission. I don't think I can give it to you!

MaceKuailv commented 8 months ago

396 fixed this