Investigation why intake-esgf has information about urls that we dont!

I have now experienced several times, when I was able to parse iids, but then the url search would not return anything. I think I finally understand why. Ok first lets establish two iid lists that work/don't work with get_urls_from_esgf:

[!NOTE] All of the intake-esgf parts below run from a PR Branch, which modifies the code to put out the file info without downloading any data. The details here do not matter much, what matters is that intake-esgf actually finds this information whereas pangeo-forge-esgf does not!

fail_iids = [
    "CMIP6.ScenarioMIP.NCAR.CESM2-WACCM.ssp245.r1i1p1f1.SImon.sifb.gn.v20190815",
    "CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r8i1p1f1.Omon.zmeso.gn.v20180803",
    "CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r24i1p1f1.SImon.sifb.gn.v20180803",
    "CMIP6.ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp585.r47i1p1f1.SImon.sifb.gn.v20190815",
    "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.3hr.pr.gn.v20190710",
    "CMIP6.ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp245.r3i1p1f1.Omon.zmeso.gn.v20191121",
    "CMIP6.ScenarioMIP.MPI-M.MPI-ESM1-2-LR.ssp245.r43i1p1f1.SImon.sifb.gn.v20190815",
    "CMIP6.CMIP.IPSL.IPSL-CM6A-LR.historical.r29i1p1f1.SImon.siitdthick.gn.v20180803",
]

pass_iids = [
    "CMIP6.CMIP.THU.CIESM.historical.r3i1p1f1.Omon.tos.gn.v20200220",
    "CMIP6.ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp245.r15i1p1f2.day.pr.gr.v20201015",
    "CMIP6.ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp245.r111i1p1f1.day.psl.gr.v20210401",
    "CMIP6.ScenarioMIP.MIROC.MIROC6.ssp585.r31i1p1f1.day.pr.gn.v20200623",
    "CMIP6.CMIP.MIROC.MIROC6.historical.r37i1p1f1.day.pr.gn.v20200519",
]
from pangeo_forge_esgf.recipe_inputs import get_urls_from_esgf
url_dict_fail = await get_urls_from_esgf(fail_iids)
url_dict_pass = await get_urls_from_esgf(pass_iids)
assert len(url_dict_fail) == 0
assert set(url_dict_pass) == set(pass_iids)

This confirms that we found NO info on any of the first set of iids, and info for all of the second set.

Now lets test this with intake-esgf:

import intake_esgf
from intake_esgf import ESGFCatalog
from intake_esgf.base import NoSearchResults
from pangeo_forge_esgf.utils import facets_from_iid

intake_esgf.conf.set(indices={
    "esgf-node.llnl.gov":True,
    "esg-dn1.nsc.liu.se":True,
    "esgf-data.dkrz.de":True,
    "esgf-node.ipsl.upmc.fr":True,
    "esgf-node.ornl.gov":True,
    "esgf.ceda.ac.uk":True,
    # "esgf.nci.org.au":True,
})
cat = ESGFCatalog()
def get_urls_from_intake_esgf(iid:str, cat:ESGFCatalog):
    print(iid)
    facets = facets_from_iid(iid)
    facets['version'] = facets['version'].replace('v','') # shouldn't be necessary once https://github.com/jbusecke/pangeo-forge-esgf/pull/41 is merged
    try:
        res = cat.search(**facets)
        return res.to_http_link_dict()
    except NoSearchResults:
        return None

intake_esgf_dict_fail = {iid: get_urls_from_intake_esgf(iid, cat) for iid in fail_iids}
intake_esgf_dict_pass = {iid: get_urls_from_intake_esgf(iid, cat) for iid in pass_iids}
assert len([k for k,v in intake_esgf_dict_fail.items() if v is None]) == 0
assert len([k for k,v in intake_esgf_dict_pass.items() if v is None]) == 0

intake-esgf finds info for ALL of the iids in either set!

So what the heck am I doing wrong here? Digging into the code of intake-esgf more I am getting a suspicion:

The general pattern of intake-esgf is to do two sorts of queries to the ESGF REST API

A 'search' query which takes facets as input and then populates the catalog with facets and importantly an id field which is formatted as "|".
A 'get dataset info' query which takes these 'id' values from above as input. Important here is that this query DOES NOT USE the full set of facets (it just uses 'variable', but if I read this correctly this is mainly to ensure compatibility with other collections not CMIP6?)

So this represents some sort of 'nested' query. If we try that approach with vanilla requests, we see that it works!

import requests
import json
def get_ids(iid, search_url):
    facets = facets_from_iid(iid)
    facets['version'] = facets['version'].replace('v','')
    params = {
        "type": "Dataset",
        "format": "application/solr+json",
        "distrib":"true",
        "limit":20,
    }
    params.update(facets)
    resp = requests.get(url=search_url, params=params)
    return [d['id'] for d in resp.json()['response']['docs']]

def get_files(dataset_ids, search_url):
    params = {
        "type": "File",
        "format": "application/solr+json",
        "distrib":"true",
        "limit":20,
        "dataset_id":dataset_ids
    }
    resp = requests.get(url=search_url, params=params)
    return [{f:d[f] for f in ['id','url']} for d in resp.json()['response']['docs']]

fix_fail_iids = {iid:get_files(get_ids(iid, "https://esgf-node.llnl.gov/esg-search/search"),"https://esgf-node.llnl.gov/esg-search/search") for iid in fail_iids}
fix_pass_iids = {iid:get_files(get_ids(iid, "https://esgf-node.llnl.gov/esg-search/search"),"https://esgf-node.llnl.gov/esg-search/search") for iid in pass_iids}

assert len([k for k,v in fix_fail_iids.items() if len(v) == 0]) == 0
assert len([k for k,v in fix_pass_iids.items() if len(v) == 0]) == 0

This is honestly pretty damn frustrating since nothing about this is mentioned in the API docs as far as I can tell. In fact they state that 'type' input defines which kind of 'record' (File or Dataset) you will get back and then show examples of faceted search here and say this:

The “type” facet must be always specified as part of any request to the ESGF search services, so that the appropriate records can be searched and returned. If not specified explicitly, the default value is type=Dataset .

All of this led me to believe that when I specify the identical set of facets and switch the 'type' I would get the matching set of files and iids depending on the value I provide. I guess I was wrong 😩.

The most disturbing thing is that some entries clearly work as I thought (otherwise I would have never gotten any results)...

Well at least I have a clue how to progress on this for now. Big thanks to @nocollier for all the work on intake-esgf. I would be curious where you learned that these 'nested' requests are needed to get all the data (I might just have missed something important).

I am fairly confident that with this knowledge I would be able to refactor large parts of pangeo-forge-esgf.

It might however be more practical to add a dependency to intake-esgf, even though the async request might still be a bit faster.

jbusecke / pangeo-forge-esgf

Investigation why intake-esgf has information about urls that we dont! #42