jbusecke / pangeo-forge-esgf

Using queries to the ESGF API to generate urls and keyword arguments for receipe generation in pangeo-forge
Apache License 2.0
6 stars 4 forks source link

Incomplete file listings #46

Open jbusecke opened 3 months ago

jbusecke commented 3 months ago

https://github.com/leap-stc/cmip6-leap-feedstock/issues/116#issuecomment-2101604165 describes a case where I get a nice list of files back, but they are not complete! How do we detect this case before ingesting?

jbusecke commented 3 months ago

Yup I just confirmed that the distributed search does not work properly 😡:

from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
for search_node in ["https://esgf-node.llnl.gov",
            "https://esgf-data.dkrz.de",
            "https://esgf.nci.org.au",
            "https://esgf-node.ornl.gov",
            "https://esgf-node.ipsl.upmc.fr",
            "https://esg-dn1.nsc.liu.se",
            "https://esgf.ceda.ac.uk",]:
    client = ESGFClient(search_node, distributed=True)
    dataset_id = client.get_instance_id_input([iid])[iid]['id']
    details = client._search_files_from_dataset_ids([dataset_id])
    print(f"{search_node=} {[i['id'] for i in details]}")

Which means I will now have to query every index node separately and combine results, what a pain.

Ill stop here for now, but list the options I have going forwards

  1. Loop over multiple index nodes as part of my client class (seems slow and annoying)

  2. Use intake-esgf which already does this (would reduce my maintenance burden, but probably also be very slow; needs testing)

  3. Rewrite the client AGAIN to do all the requesting async (probably fastest, but also significant work)

  4. Seems useless, since I might as well go to 2. So I guess ill time 2. and then decide if it might be worth embarking on 3.

😩

jbusecke commented 3 months ago

Example how to maybe use intake-esgf:

!pip install git+https://github.com/jbusecke/intake-esgf.git@http-links

import intake_esgf
from intake_esgf import ESGFCatalog
from intake_esgf.base import NoSearchResults
from pangeo_forge_esgf.utils import facets_from_iid

intake_esgf.conf.set(indices={
    "esgf-node.llnl.gov":True,
    "esg-dn1.nsc.liu.se":True,
    "esgf-data.dkrz.de":True,
    "esgf-node.ipsl.upmc.fr":True,
    "esgf-node.ornl.gov":True,
    "esgf.ceda.ac.uk":True,
    # "esgf.nci.org.au":True,
})
cat = ESGFCatalog()
def get_urls_from_intake_esgf(iid:str, cat:ESGFCatalog):
    print(iid)
    facets = facets_from_iid(iid)
    facets['version'] = facets['version'].replace('v','') # shouldn't be necessary once https://github.com/jbusecke/pangeo-forge-esgf/pull/41 is merged
    try:
        res = cat.search(**facets)
        return res.to_http_link_dict()
    except NoSearchResults:
        return None

iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
a = get_urls_from_intake_esgf(iid, cat)
[i['path'] for i in a]
jbusecke commented 3 months ago

Ah here is a way to fail out these instances of incomplete filenames:

from pangeo_forge_esgf.client import ESGFClient
import json
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
d = client.get_instance_id_input([iid])
print(json.dumps(d, indent=4))

This produces

{ "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710": { "id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|esgf.nci.org.au", "version": "20190710", "access": [ "HTTPServer", "GridFTP", "OPENDAP", "Globus" ], "activity_drs": [ "CMIP" ], "activity_id": [ "CMIP" ], "cf_standard_name": [ "air_temperature" ], "citation_url": [ "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json" ], "data_node": "esgf.nci.org.au", "data_specs_version": [ "01.00.30" ], "dataset_id_template_": [ "%(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s" ], "datetime_start": "1975-01-16T12:00:00Z", "datetime_stop": "2014-12-16T12:00:00Z", "directory_format_template_": [ "%(root)s/%(mip_era)s/%(activity_drs)s/%(institution_id)s/%(source_id)s/%(experiment_id)s/%(member_id)s/%(table_id)s/%(variable_id)s/%(grid_label)s/%(version)s" ], "east_degrees": 359.0625, "experiment_id": [ "historical" ], "experiment_title": [ "all-forcing simulation of the recent past" ], "frequency": [ "mon" ], "further_info_url": [ "https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-ESM1-2-HR.historical.none.r1i1p1f1" ], "geo": [ "ENVELOPE(-180.0, -0.9375, 89.284225, -89.284225)", "ENVELOPE(0.0, 180.0, 89.284225, -89.284225)" ], "geo_units": [ "degrees_east" ], "grid": [ "gn" ], "grid_label": [ "gn" ], "index_node": "esgf.nci.org.au", "instance_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710", "institution_id": [ "MPI-M" ], "latest": true, "master_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn", "member_id": [ "r1i1p1f1" ], "mip_era": [ "CMIP6" ], "model_cohort": [ "Registered" ], "nominal_resolution": [ "100 km" ], "north_degrees": 89.284225, "number_of_aggregations": 1, "number_of_files": 8, "pid": [ "hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c" ], "product": [ "model-output" ], "project": [ "CMIP6" ], "realm": [ "atmos" ], "replica": true, "size": 56793078, "source_id": [ "MPI-ESM1-2-HR" ], "source_type": [ "AOGCM" ], "south_degrees": -89.284225, "sub_experiment_id": [ "none" ], "table_id": [ "Amon" ], "title": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn", "type": "Dataset", "url": [ "http://esgf.nci.org.au/thredds/catalog/esgcet/CMIP6/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Amon/tas/gn/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.xml#CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|application/xml+thredds|THREDDS" ], "variable": [ "tas" ], "variable_id": [ "tas" ], "variable_long_name": [ "Near-Surface Air Temperature" ], "variable_units": [ "K" ], "variant_label": [ "r1i1p1f1" ], "west_degrees": 0.0, "xlink": [ "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json|Citation|citation", "http://hdl.handle.net/hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c|PID|pid" ], "_version_": 1689449850470400000, "retracted": false, "_timestamp": "2021-01-20T23:22:11.250Z", "score": 1.0 } }

My idea is to use "datetime_start": "1975-01-16T12:00:00Z", "datetime_stop": "2014-12-16T12:00:00Z",

inject them as dataset attributes, and then run a check against the actual dataset time data to confirm that the dataset covers this (or at least close to this).