intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

Open PID with Xarray #139

Open Marco-DKRZ opened 9 months ago

Marco-DKRZ commented 9 months ago

Is it possible to implement a feature, which enables intake-xarray to open a file based on its PID?

For example:

intake.open('hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df')

In this example hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df is a PID handle of a CMIP6 precipitation data set.

martindurant commented 9 months ago

Could you please explain what a PID is, and how you map it to the actual asset/file?

Marco-DKRZ commented 9 months ago

A PID is a Persistent identifier, which is a long lasting reference to a digital object (https://en.wikipedia.org/wiki/Persistent_identifier). PIDs can be resolved via a handle server: https://hdl.handle.net/

In the example above the request would look like this: https://hdl.handle.net/api/handles/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df, where 21.14100/02c6b729-fff6-4f31-a8da-2cf590b544dfz is the PID.

The json response has an entry with the file location:

9   
index   10
type    "URL_ORIGINAL_DATA"
data    
format  "string"
value   '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'
ttl 86400
timestamp   "2021-12-21T13:09:20Z"

This file can be downloaded or opened directly with xarray. An example of this workflow can be found in the following notebook: https://gitlab.dkrz.de/data-infrastructure-services/fdo/-/blob/master/automated_data_access_improved.ipynb?ref_type=heads

PIDs for this kind of climate data (CMIP6, https://en.wikipedia.org/wiki/Coupled_Model_Intercomparison_Project) are standardized with always the same keywords.

My question would be, is it possible to implement a function that allows xarray to open a file by simply passing its PID?

martindurant commented 9 months ago

Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).

is it possible to implement a function that allows xarray to open a file by simply passing its PID

Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.

Question: since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?

martindurant commented 9 months ago

Was this closed in error?

With scratch code in Intake 2, I have

In [1]: import intake

In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df")

In [3]: h.to_reader().read()
Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}}

In [4]: _.to_reader("xarray").read()
Out[4]:
<xarray.Dataset>
Dimensions:    (time: 7305, bnds: 2, lat: 96, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
...
Marco-DKRZ commented 9 months ago

Was this closed in error?

Yes, it has been closed in error.

With scratch code in Intake 2, I have

In [1]: import intake

In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df")

In [3]: h.to_reader().read()
Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}}

In [4]: _.to_reader("xarray").read()
Out[4]:
<xarray.Dataset>
Dimensions:    (time: 7305, bnds: 2, lat: 96, lon: 192)
Coordinates:
  * time       (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00
  * lat        (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
...

Perfect, this looks like the workflow I imagined.

Marco-DKRZ commented 9 months ago

Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).

is it possible to implement a function that allows xarray to open a file by simply passing its PID

Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.

Question: since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?

That is actually a good question. The example provided was a single file. However, we also have dataset PIDs, e.g. 21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace. For those intake might be the better choice.

martindurant commented 9 months ago

Using the HAS_PARTS value?

Marco-DKRZ commented 9 months ago

What exactly do you mean? Yes, the aggregated dataset PIDs (e.g. https://hdl.handle.net/api/handles/21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace) show the file PIDs under HAS_PARTS.

martindurant commented 9 months ago

OK, this class implements it for V2, although some questions remain. It could also be included in this repo for V1.

Marco-DKRZ commented 9 months ago

Thanks a lot for implementing it. :-) Which questions remain?

martindurant commented 9 months ago

There are some comments in the code.

It's a little awkward to return data instances, which you then have to do something with; so maybe it would be better to return Xarray readers or even the final xarray instances.

Marco-DKRZ commented 9 months ago

That is a valid point. Anyway, it is a start!