hainegroup / oceanspy

A Python package to facilitate ocean model data analysis and visualization.
https://oceanspy.readthedocs.io
MIT License
101 stars 32 forks source link

SciServer-ceph intake issue #440

Open ThomasHaine opened 3 months ago

ThomasHaine commented 3 months ago

Folks, especially @Mikejmnez , I'm trying to get oceanspy to load the new datasets from SciServer-ceph. I've:

  1. Transferred data to ceph,.
  2. Mitya has provisioned new data volumes Poseidon-ceph and oceanography-ceph in the Grendel domain.
  3. Forked oceanspy to work on the updated intake catalog code. See my ceph-dev branch and sciserver_catalogs/catalog_xarry.yaml.
  4. Installed intake.
  5. Tried to open a ceph dataset and hit this:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [2], line 1
----> 1 od = ospy.open_oceandataset.from_catalog("get_started")

File ~/workspace/Storage/Thomas.Haine/persistent/Poseidon testing/ceph-dev/oceanspy/oceanspy/open_oceandataset.py:138, in from_catalog(name, catalog_url)
    133 for entry in entries:
    134     if intake_switch:
    135         # Use intake-xarray
    136 
    137         # Pop metadata
--> 138         mtdt = cat[entry].metadata
    140         # Create ds
    141         ds = cat[entry].to_dask()

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/base.py:472, in Catalog.__getitem__(self, key)
    463 """Return a catalog entry by name.
    464 
    465 Can also use attribute syntax, like ``cat.entry_name``, or
   (...)
    468 cat['name1', 'name2']
    469 """
    470 if not isinstance(key, list) and key in self:
    471     # triggers reload_on_change
--> 472     s = self._get_entry(key)
    473     if s.container == "catalog":
    474         s.name = key

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/utils.py:43, in reload_on_change.<locals>.wrapper(self, *args, **kwargs)
     40 @functools.wraps(f)
     41 def wrapper(self, *args, **kwargs):
     42     self.reload()
---> 43     return f(self, *args, **kwargs)

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/base.py:355, in Catalog._get_entry(self, name)
    353 ups = [up for name, up in self.user_parameters.items() if name not in up_names]
    354 entry._user_parameters = ups + (entry._user_parameters or [])
--> 355 return entry()

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/entry.py:60, in CatalogEntry.__call__(self, persist, **kwargs)
     58 def __call__(self, persist=None, **kwargs):
     59     """Instantiate DataSource with given user arguments"""
---> 60     s = self.get(**kwargs)
     61     s._entry = self
     62     s._passed_kwargs = list(kwargs)

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/local.py:312, in LocalCatalogEntry.get(self, **user_parameters)
    309 if not user_parameters and self._default_source is not None:
    310     return self._default_source
--> 312 plugin, open_args = self._create_open_args(user_parameters)
    313 data_source = plugin(**open_args)
    314 data_source.catalog_object = self._catalog

File ~/mambaforge/envs/py39/lib/python3.9/site-packages/intake/catalog/local.py:283, in LocalCatalogEntry._create_open_args(self, user_parameters)
    273 open_args = merge_pars(
    274     params,
    275     user_parameters,
   (...)
    279     client=False,
    280 )
    282 if len(self._plugin) == 0:
--> 283     raise ValueError(
    284         "No plugins loaded for this entry: %s\n"
    285         "A listing of installable plugins can be found "
    286         "at https://intake.readthedocs.io/en/latest/plugin"
    287         "-directory.html ." % self._driver
    288     )
    289 elif isinstance(self._plugin, list):
    290     plugin = self._plugin[0]

ValueError: No plugins loaded for this entry: netcdf
A listing of installable plugins can be found at https://intake.readthedocs.io/en/latest/plugin-directory.html .

I'm confused because netCDF4 is installed. Any ideas on how to fix/what to do next?

Mikejmnez commented 3 months ago

Hey @ThomasHaine - confusing. Some quick questions to help me understand:

  1. Did you install and activate the Oceanography environment before running the jupyter notebook?
  2. Can you share the environment? Do something like:
mamba list

Intake and intake-xarray are both needed .For example, in my Oceanography env I have Intake v2.0.3 and intake-xarray 0.7.0.

ThomasHaine commented 3 months ago

Thanks @Mikejmnez. Good point about the environment. It wasn't properly installed. To fix it, I ran:

conda env create -f oceanspy/sciserver_catalogs/environment.yml
conda activate Oceanography
pip install ipykernel
ipython kernel install --user --name=Oceanography-ceph

Then conda info --envs gives:

# conda environments:
#
base                     /home/idies/mambaforge
Oceanography          *  /home/idies/mambaforge/envs/Oceanography
py39                     /home/idies/mambaforge/envs/py39

Now I select the Oceanography-ceph kernel for my notebook. It still errors with:

FileNotFoundError: [Errno 2] No such file or directory: '/home/idies/workspace/OceanCirculation/exp_ASR/grid.nc'

This confuses me because this path has been replaced in sciserver_catalog/catalog_xarray.yaml.

ThomasHaine commented 3 months ago

OK, some progress: The .yaml catalogs are hard-coded in open_oceandataset.py and by default read the main stable release. Override the default like this:

catalog_url = (
                "https://raw.githubusercontent.com/ThomasHaine/oceanspy/"
                "ceph-dev/sciserver_catalogs/catalog_xarray.yaml"
            )
od = ospy.open_oceandataset.from_catalog("get_started",catalog_url)

Now it's reading the ceph directory,

Mikejmnez commented 3 months ago

Just catching up. That makes sense. Another alternative is to create your own yaml catalog with catalog_url and use that. I usually go around this way since there is no need to undo the changes to oceanspy. Just make sure to reverse the change when you're ready to push onto main branch (PR).

ThomasHaine commented 3 months ago

Sounds good. Do you suggest I create (e.g.) catalog_xarray-ceph.yaml and catalog_xmitgcm-ceph.yaml and a new sciserver-ceph dataset in datasets_list.yaml? Then we can add the new data sources in open_oceandataset.py (I might need some help with this bit!).

Mikejmnez commented 3 months ago

No, I think the way you were doing it was appropriate. You are essentially migrating the data to ceph and that requires updating the access pattern. Once you push your changes to a new PR and before merging, we should restore how open_oceandataset.py reads from main. That is, replace ceph-dev with main below

catalog_url = (
                "https://raw.githubusercontent.com/ThomasHaine/oceanspy/"
                "ceph-dev/sciserver_catalogs/catalog_xarray.yaml"
            )

Were you able to read the datasets from ceph?

ThomasHaine commented 3 months ago

Sounds good. But we should maintain the original (filedb) functionality too, at least for a while. What's the easiest way to keep both access methods functional at the same time?

Yes, I can read the datasets from ceph. I've copied several (no LLC4320 or DYAMOND yet), and will test in the next few days.

ThomasHaine commented 3 months ago

Actually, I can't read all the datasets. For IGPwinter, EGshelfIIseas2km_ASR_{crop,full}, and EGshelfIIseas2km_ERAI_{6H,1D} I get this error:

Opening EGshelfIIseas2km_ERAI_1D.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[9], line 23
      6 catalog_url2 = (
      7                  "https://raw.githubusercontent.com/ThomasHaine/oceanspy/"
      8                  "ceph-dev/sciserver_catalogs/catalog_xmitgcm.yaml"
      9              )
     11 # od = ospy.open_oceandataset.from_catalog("EGshelfIIseas2km_ASR_full",catalog_url1)
     12 # print(od.dataset)
     13 # print('\n')
   (...)
     20 # print(od.dataset)
     21 # print('\n')
---> 23 od = ospy.open_oceandataset.from_catalog("EGshelfIIseas2km_ERAI_1D",catalog_url1)
     24 print(od.dataset)
     25 print('\n')

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/oceanspy/open_oceandataset.py:141, in from_catalog(name, catalog_url)
    138     mtdt = cat[entry].metadata
    140     # Create ds
--> 141     ds = cat[entry].to_dask()
    142 else:
    143     # Pop args and metadata
    144     args = cat[entry].pop("args")

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:69, in DataSourceMixin.to_dask(self)
     67 def to_dask(self):
     68     """Return xarray object where variables are dask arrays"""
---> 69     return self.read_chunked()

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:44, in DataSourceMixin.read_chunked(self)
     42 def read_chunked(self):
     43     """Return xarray object (which will have chunks)"""
---> 44     self._load_metadata()
     45     return self._ds

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake/source/base.py:84, in DataSourceBase._load_metadata(self)
     82 """load metadata only if needed"""
     83 if self._schema is None:
---> 84     self._schema = self._get_schema()
     85     self.dtype = self._schema.dtype
     86     self.shape = self._schema.shape

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/base.py:18, in DataSourceMixin._get_schema(self)
     15 self.urlpath = self._get_cache(self.urlpath)[0]
     17 if self._ds is None:
---> 18     self._open_dataset()
     20     metadata = {
     21         'dims': dict(self._ds.dims),
     22         'data_vars': {k: list(self._ds[k].coords)
     23                       for k in self._ds.data_vars.keys()},
     24         'coords': tuple(self._ds.coords.keys()),
     25     }
     26     if getattr(self, 'on_server', False):

File ~/mambaforge/envs/Oceanography/lib/python3.10/site-packages/intake_xarray/netcdf.py:73, in NetCDFSource._open_dataset(self)
     71 if "*" in url or isinstance(url, list):
     72     _open_dataset = xr.open_mfdataset
---> 73     if self.pattern:
     74         kwargs.update(preprocess=self._add_path_to_ds)
     75     if self.combine is not None:

AttributeError: 'NetCDFSource' object has no attribute 'pattern'

Any ideas what's going on?