Closed MaceKuailv closed 8 months ago
I was looking at the datasets that we have, and the one called Arctic_Control cannot be opened. Is this something we know?
Can you share what is the error (if there is one)?
Sorry, I forgot to include that. Seems to be an error when combining the data.
Here is the full error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[2], line 1
----> 1 od = ospy.open_oceandataset.from_catalog("Arctic_Control")
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/oceanspy/open_oceandataset.py:141, in from_catalog(name, catalog_url)
138 mtdt = cat[entry].metadata
140 # Create ds
--> 141 ds = cat[entry].to_dask()
142 else:
143 # Pop args and metadata
144 args = cat[entry].pop("args")
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
67 def to_dask(self):
68 """Return xarray object where variables are dask arrays"""
---> 69 return self.read_chunked()
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
42 def read_chunked(self):
43 """Return xarray object (which will have chunks)"""
---> 44 self._load_metadata()
45 return self._ds
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake/source/base.py:279, in DataSourceBase._load_metadata(self)
277 """load metadata only if needed"""
278 if self._schema is None:
--> 279 self._schema = self._get_schema()
280 self.dtype = self._schema.dtype
281 self.shape = self._schema.shape
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:18, in DataSourceMixin._get_schema(self)
15 self.urlpath = self._get_cache(self.urlpath)[0]
17 if self._ds is None:
---> 18 self._open_dataset()
20 metadata = {
21 'dims': dict(self._ds.dims),
22 'data_vars': {k: list(self._ds[k].coords)
23 for k in self._ds.data_vars.keys()},
24 'coords': tuple(self._ds.coords.keys()),
25 }
26 if getattr(self, 'on_server', False):
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/netcdf.py:92, in NetCDFSource._open_dataset(self)
88 else:
89 # https://github.com/intake/filesystem_spec/issues/476#issuecomment-732372918
90 url = fsspec.open(self.urlpath, **self.storage_options).open()
---> 92 self._ds = _open_dataset(url, chunks=self.chunks, **kwargs)
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/backends/api.py:1053, in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, data_vars, coords, combine, parallel, join, attrs_file, combine_attrs, **kwargs)
1049 try:
1050 if combine == "nested":
1051 # Combined nested list by successive concat and merge operations
1052 # along each dimension, using structure given by "ids"
-> 1053 combined = _nested_combine(
1054 datasets,
1055 concat_dims=concat_dim,
1056 compat=compat,
1057 data_vars=data_vars,
1058 coords=coords,
1059 ids=ids,
1060 join=join,
1061 combine_attrs=combine_attrs,
1062 )
1063 elif combine == "by_coords":
1064 # Redo ordering from coordinates, ignoring how they were ordered
1065 # previously
1066 combined = combine_by_coords(
1067 datasets,
1068 compat=compat,
(...)
1072 combine_attrs=combine_attrs,
1073 )
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:359, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
356 _check_shape_tile_ids(combined_ids)
358 # Apply series of concatenate or merge operations along each dimension
--> 359 combined = _combine_nd(
360 combined_ids,
361 concat_dims,
362 compat=compat,
363 data_vars=data_vars,
364 coords=coords,
365 fill_value=fill_value,
366 join=join,
367 combine_attrs=combine_attrs,
368 )
369 return combined
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:235, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
231 # Each iteration of this loop reduces the length of the tile_ids tuples
232 # by one. It always combines along the first dimension, removing the first
233 # element of the tuple
234 for concat_dim in concat_dims:
--> 235 combined_ids = _combine_all_along_first_dim(
236 combined_ids,
237 dim=concat_dim,
238 data_vars=data_vars,
239 coords=coords,
240 compat=compat,
241 fill_value=fill_value,
242 join=join,
243 combine_attrs=combine_attrs,
244 )
245 (combined_ds,) = combined_ids.values()
246 return combined_ds
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:270, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
268 combined_ids = dict(sorted(group))
269 datasets = combined_ids.values()
--> 270 new_combined_ids[new_id] = _combine_1d(
271 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
272 )
273 return new_combined_ids
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/combine.py:293, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
291 if concat_dim is not None:
292 try:
--> 293 combined = concat(
294 datasets,
295 dim=concat_dim,
296 data_vars=data_vars,
297 coords=coords,
298 compat=compat,
299 fill_value=fill_value,
300 join=join,
301 combine_attrs=combine_attrs,
302 )
303 except ValueError as err:
304 if "encountered unexpected variable" in str(err):
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:251, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
239 return _dataarray_concat(
240 objs,
241 dim=dim,
(...)
248 combine_attrs=combine_attrs,
249 )
250 elif isinstance(first_obj, Dataset):
--> 251 return _dataset_concat(
252 objs,
253 dim=dim,
254 data_vars=data_vars,
255 coords=coords,
256 compat=compat,
257 positions=positions,
258 fill_value=fill_value,
259 join=join,
260 combine_attrs=combine_attrs,
261 )
262 else:
263 raise TypeError(
264 "can only concatenate xarray Dataset and DataArray "
265 f"objects, got {type(first_obj)}"
266 )
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:507, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs)
504 datasets = [cast(T_Dataset, ds.expand_dims(dim)) for ds in datasets]
506 # determine which variables to concatenate
--> 507 concat_over, equals, concat_dim_lengths = _calc_concat_over(
508 datasets, dim, dim_names, data_vars, coords, compat
509 )
511 # determine which variables to merge, and then merge them according to compat
512 variables_to_merge = (coord_names | data_names) - concat_over - unlabeled_dims
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:408, in _calc_concat_over(datasets, dim, dim_names, data_vars, coords, compat)
405 concat_over.update(opt)
407 process_subset_opt(data_vars, "data_vars")
--> 408 process_subset_opt(coords, "coords")
409 return concat_over, equals, concat_dim_lengths
File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/xarray/core/concat.py:340, in _calc_concat_over.<locals>.process_subset_opt(opt, subset)
338 break
339 elif len(variables) != len(datasets) and opt == "different":
--> 340 raise ValueError(
341 f"{k!r} not present in all datasets and coords='different'. "
342 f"Either add {k!r} to datasets where it is missing or "
343 "specify coords='minimal'."
344 )
346 # first check without comparing values i.e. no computes
347 for var in variables[1:]:
ValueError: 'X' not present in all datasets and coords='different'. Either add 'X' to datasets where it is missing or specify coords='minimal'.
We may as well just convert is to a single zarr file.
ValueError: 'X' not present in all datasets and coords='different'. Either add 'X' to datasets where it is missing or specify coords='minimal'.
Perhaps there is a simpler solution to this issue.
@MaceKuailv please follow up with @Mikejmnez to figure out the simpler solution.
Before fully transforming the entire dataset, I really wanted us to diagnose what was going on... I spent some time looking into this and this is what I found:
1) The file SIGMA0.nc
in every directory has no coordinates.
path1 = '/home/idies/workspace/OceanCirculation/exp_Arctic_Control/days10950-12045/DIAGS/'
ds = xr.open_dataset(path1+'SIGMA0.nc')
ds.coords
Coordinates:
*empty*
2) Within a single directory, the following works:
ds = xr.open_mfdataset(path1+'*.nc', drop_variables=['diag_levels'])
3) When trying to read the entire dataset across multiple directories, xarray cannot figure out how to best concatenate variables to combine datasets Because SIGMA0 does not have coordinates.
Excluding SIGMA0
works, and actually there is no need to specify arguments 'combine=nested', concat_dim='T'
. This is the following works
ds = xr.open_mfdataset(list_files1 + list_files2, engine='netcdf4', parallel=True, drop_variables=['diag_levels', 'iter'])
where
list_files1 = []
for var in ds.data_vars:
if var != 'SIGMA0':
list_files1.append(path1+var+'.nc')
and similarly for list_files2
with a different directory path2
.
Including SIGMA0.nc
does not work anymore because it does not have coordinates. The issue with SIGMA0
is a bit frustrating because it is a variable that gets dropped (see intake catalog entry). However, it seems that before dropping the variable, the xr.dataset
needs to be created.....
1) a) Modify each SIGMA0.nc
file across all directories, so that it has consistent coordinates.
2) As Wenrui suggested, create a single zarr file containing all data, with well defined coordinates and dimensions.
3) a) Modify the intake catalog so that url_path contains the paths for all variables excluding SIGMA0.nc
, similar to list_files1+list_files2.
I'm neutral about these solutions, although perhaps number 2 is best? I also think we should try to be consistent across all datasets, as much as possible.
I'm neutral about these solutions, although perhaps number 2 is best? I also think we should try to be consistent across all datasets, as much as possible.
sounds like a good plan.
@MaceKuailv can you implement this fix (number 2, single zarr file) above?
Yes, I will follow what open the file with "Mike's " method and convert them to zarr and create new files.
(py39) idies@fde1b55f22c5:~$ touch /home/idies/workspace/OceanCirculation/exp_Arctic_Control/hello_its_me.txt
touch: cannot touch '/home/idies/workspace/OceanCirculation/exp_Arctic_Control/hello_its_me.txt': Read-only file system
I could not access the original data from datascope, and could not write to filedb or OceanCirculation with my sciserver account "wenrui".
I have the file sitting in my scratch directory, I can move it over whenever I got the permission. As the soon-to-be third most senior member of the group, I think I am trustworthy enough.
Ask Gerard for permission. I don't think I can give it to you!
I was looking at the datasets that we have, and the one called Arctic_Control cannot be opened. Is this something we know?
On a side note, how many of our data are netCDF? Do we want to convert some of them to zarr? (for example, the ones Joan has been using heavily)