Open Marco-DKRZ opened 11 months ago
Could you please explain what a PID is, and how you map it to the actual asset/file?
A PID is a Persistent identifier, which is a long lasting reference to a digital object (https://en.wikipedia.org/wiki/Persistent_identifier). PIDs can be resolved via a handle server: https://hdl.handle.net/
In the example above the request would look like this: https://hdl.handle.net/api/handles/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df
, where 21.14100/02c6b729-fff6-4f31-a8da-2cf590b544dfz
is the PID.
The json response has an entry with the file location:
9
index 10
type "URL_ORIGINAL_DATA"
data
format "string"
value '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'
ttl 86400
timestamp "2021-12-21T13:09:20Z"
This file can be downloaded or opened directly with xarray. An example of this workflow can be found in the following notebook: https://gitlab.dkrz.de/data-infrastructure-services/fdo/-/blob/master/automated_data_access_improved.ipynb?ref_type=heads
PIDs for this kind of climate data (CMIP6, https://en.wikipedia.org/wiki/Coupled_Model_Intercomparison_Project) are standardized with always the same keywords.
My question would be, is it possible to implement a function that allows xarray to open a file by simply passing its PID?
Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).
is it possible to implement a function that allows xarray to open a file by simply passing its PID
Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.
Question: since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?
Was this closed in error?
With scratch code in Intake 2, I have
In [1]: import intake
In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df")
In [3]: h.to_reader().read()
Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}}
In [4]: _.to_reader("xarray").read()
Out[4]:
<xarray.Dataset>
Dimensions: (time: 7305, bnds: 2, lat: 96, lon: 192)
Coordinates:
* time (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00
* lat (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57
* lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
...
Was this closed in error?
Yes, it has been closed in error.
With scratch code in Intake 2, I have
In [1]: import intake In [2]: h = intake.readers.datatypes.Handle("hdl:/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df") In [3]: h.to_reader().read() Out[3]: HDF5, {'url': 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc', 'storage_options': None, 'path': '', 'metadata': {'URL': {'format': 'string', 'value': 'https://handle-esgf.dkrz.de/lp/21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df'}, 'AGGREGATION_LEVEL': {'format': 'string', 'value': 'FILE'}, 'FIXED_CONTENT': {'format': 'string', 'value': 'TRUE'}, 'FILE_NAME': {'format': 'string', 'value': 'pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc'}, 'FILE_SIZE': {'format': 'string', 'value': '374071932'}, 'IS_PART_OF': {'format': 'string', 'value': 'hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace'}, 'FILE_VERSION': {'format': 'string', 'value': '1'}, 'CHECKSUM': {'format': 'string', 'value': '76d11477fbb4acbd2d0db1595a9ef16309f53eb6c2874078bfb122167241d2f5'}, 'CHECKSUM_METHOD': {'format': 'string', 'value': 'SHA256'}, 'URL_ORIGINAL_DATA': {'format': 'string', 'value': '<locations><location href="http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-21T13:09:16.983+00:00" host="esgf3.dkrz.de" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'URL_REPLICA': {'format': 'string', 'value': '<locations><location href="http://esgf-data1.llnl.gov/thredds/fileServer/https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2021-12-27T06:23:20.561+00:00" host="esgf-data1.llnl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /><location href="http://eagle.alcf.anl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r11i1p1f1/day/pr/gn/v20210901/pr_day_MPI-ESM1-2-LR_historical_r11i1p1f1_gn_18700101-18891231.nc" publishedOn="2023-11-15T20:37:07.249+00:00" host="eagle.alcf.anl.gov" dataset="hdl:21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace" /></locations>'}, 'PROBLEM': {'format': 'string', 'value': '500_R_N (2021-12-27T06:23:20.561+00:00);500_R_N (2023-11-15T20:37:07.249+00:00)'}, 'HS_ADMIN': {'format': 'admin', 'value': {'handle': '21.14100/ADMINLIST', 'index': 200, 'permissions': '111111111111'}}}} In [4]: _.to_reader("xarray").read() Out[4]: <xarray.Dataset> Dimensions: (time: 7305, bnds: 2, lat: 96, lon: 192) Coordinates: * time (time) datetime64[ns] 1870-01-01T12:00:00 ... 1889-12-31T12:00:00 * lat (lat) float64 -88.57 -86.72 -84.86 -83.0 ... 84.86 86.72 88.57 * lon (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1 ...
Perfect, this looks like the workflow I imagined.
Interesting! I see that the HDL server also knows about the "dataset" that this is part of (which links, in turn, to a DOI).
is it possible to implement a function that allows xarray to open a file by simply passing its PID
Certainly. It would be easy to add to intake-xarray, but I would like to add it to add it to Intake Take2, as this process "transform URL of known form to other URL of known type" is just the kind of thing it's designed for.
Question: since "hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df" is essentially URL/file like, would this actually be an fsspec-like operation rather than intake?
That is actually a good question. The example provided was a single file. However, we also have dataset PIDs, e.g. 21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace
. For those intake might be the better choice.
Using the HAS_PARTS value?
What exactly do you mean?
Yes, the aggregated dataset PIDs (e.g. https://hdl.handle.net/api/handles/21.14100/b857705a-4b21-3b43-b353-b6a2ac6d8ace) show the file PIDs under HAS_PARTS
.
OK, this class implements it for V2, although some questions remain. It could also be included in this repo for V1.
Thanks a lot for implementing it. :-) Which questions remain?
There are some comments in the code.
It's a little awkward to return data instances, which you then have to do something with; so maybe it would be better to return Xarray readers or even the final xarray instances.
That is a valid point. Anyway, it is a start!
Is it possible to implement a feature, which enables intake-xarray to open a file based on its PID?
For example:
In this example
hdl:21.14100/02c6b729-fff6-4f31-a8da-2cf590b544df
is a PID handle of a CMIP6 precipitation data set.