jbusecke / pangeo-forge-esgf

Using queries to the ESGF API to generate urls and keyword arguments for receipe generation in pangeo-forge
Apache License 2.0
6 stars 4 forks source link

Add the ability to specify a different ESGF node #20

Closed rsignell closed 4 months ago

rsignell commented 5 months ago

I searched using pangeo-forge-esgf and told a colleague that daily files for HADGEM did not exist, and he showed me they do exist, but he used the UK ESGF node:

image001

jbusecke commented 5 months ago

Did you use parse_instance_ids or get_urls_from_esgf (that one has an input argument search_nodes).

The fact that even though we are making a 'distributed request' the results differ across search nodes is sketchy though...

rsignell commented 4 months ago

@jbusecke I'm confused. I'm doing:

from pangeo_forge_esgf.parsing import parse_instance_ids

parse_iids = [
    'CMIP6.HighResMIP.*.HadGEM3-GC31-HH.*.*.*.so.gn.*',
    'CMIP6.HighResMIP.*.HadGEM3-GC31-HH.*.*.*.thetao.gn.*'
]
iids = []
for piid in parse_iids:
    iids.extend(parse_instance_ids(piid))
iids

which results in only monthly data being shown:

['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.so.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
 'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
 'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514']

but that's because it's using the LLNL node: https://github.com/jbusecke/pangeo-forge-esgf/blob/8036e365a16fdb38f3d04613da8507660fcd1312/pangeo_forge_esgf/parsing.py#L42-L45

How would I make the same request with get_urls_from_esgf?

rsignell commented 4 months ago

ping @jbusecke in case this slid off the radar screen... 🐱

jbusecke commented 4 months ago

Oh I got confused. So two things:

First this is how you would use get_urls_from_esgf

import asyncio
from pangeo_forge_esgf.recipe_inputs import get_urls_from_esgf

iids = ['']

# when running from a script
url_dict = asyncio.run(
    get_urls_from_esgf(
        iids,search_nodes=["http://esgf-node.llnl.gov/esg-search/search"]
    )
)

# for a jupyter notebook
url_dict = await get_urls_from_esgf(
        iids,search_nodes=["http://esgf-node.llnl.gov/esg-search/search"]
    )

This will not expand wildcards like parse_instance_ids. We could easily make the search node a keyword argument there, but the bigger question is WHY the behavior differs from node to node. Setting 'distrib=True' here should query from all ESGF nodes...ughhh this is annoying.

Either way, give #21 a try with the node of your choice?

jbusecke commented 4 months ago

I have seen some of this behavior in several issues, I am not getting the same results from the API as from that search interface, which is upsetting. Will have to investigate more at some point, but am short for time today.

jbusecke commented 4 months ago

Ok apparently the LLNL node is totally down? Not even their examples work...

rsignell commented 4 months ago

Interesting. the dkrz node API seems to be working. https://esgf-data.dkrz.de/esg-search/search/?cf_standard_name=air_temperature&project=obs4MIPs

jbusecke commented 4 months ago

Just running this from a dev branch locally (will push in a sec) and it seems that the node does not make a difference:

parse_iids = [
   ...:     'CMIP6.HighResMIP.*.HadGEM3-GC31-HH.*.*.*.thetao.gn.*',
   ...: ]
   ...: iids_dict = {}
   ...: search_nodes = [
   ...:     "https://esgf-node.ipsl.upmc.fr/esg-search/search",
   ...:     "https://esgf-index1.ceda.ac.uk/esg-search/search",
   ...:     "https://esgf-data.dkrz.de/esg-search/search",
   ...:     "https://esg-dn1.nsc.liu.se/esg-search/search",
   ...:     "https://esgf-node.llnl.gov/esg-search/search",
   ...:     "https://esgf.nci.org.au/esg-search/search",
   ...:     "https://esgf-node.ornl.gov/esg-search/search",
   ...:     ]
   ...: for node in search_nodes:
   ...: 
   ...:     iids = []
   ...:     for piid in parse_iids:
   ...:         iids.extend(parse_instance_ids(piid, search_node="https://esgf-node.ipsl.upmc.fr/esg-search/search"))
   ...:     iids_dict[node] = iids
   ...: iids_dict

gives:

{'https://esgf-node.ipsl.upmc.fr/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esgf-index1.ceda.ac.uk/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esgf-data.dkrz.de/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esg-dn1.nsc.liu.se/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esgf-node.llnl.gov/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esgf.nci.org.au/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514'],
 'https://esgf-node.ornl.gov/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Omon.thetao.gn.v20200514',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Omon.thetao.gn.v20200514']}

So the good news is the returns from the API are consistent.

The bad news is: WHY ON EARTH DOES THE WEB INTERFACE SHOW DIFFERENT DATASETS THAN THE API?

jbusecke commented 4 months ago

Actually! Looking at the screenshot again I do not see thetao and so in your query.

jbusecke commented 4 months ago

@rsignell are you sure that daily files exist for the ocean variables? Just trying pr (which is one of the high time res datasets shown above) yields results!

from pangeo_forge_esgf.parsing import parse_instance_ids
   ...: 
   ...: parse_iids = [
   ...:     'CMIP6.HighResMIP.*.HadGEM3-GC31-HH.*.*.*.pr.gn.*',
   ...: ]
   ...: iids_dict = {}
   ...: search_nodes = [
   ...:     "https://esgf-node.ipsl.upmc.fr/esg-search/search",
   ...:     "https://esgf-index1.ceda.ac.uk/esg-search/search",
   ...:     "https://esgf-data.dkrz.de/esg-search/search",
   ...:     "https://esg-dn1.nsc.liu.se/esg-search/search",
   ...:     "https://esgf-node.llnl.gov/esg-search/search",
   ...:     "https://esgf.nci.org.au/esg-search/search",
   ...:     "https://esgf-node.ornl.gov/esg-search/search",
   ...:     ]
   ...: for node in search_nodes:
   ...: 
   ...:     iids = []
   ...:     for piid in parse_iids:
   ...:         iids.extend(parse_instance_ids(piid, search_node="https://esgf-node.ipsl.upmc.fr/
   ...: esg-search/search"))
   ...:     iids_dict[node] = iids
   ...: iids_dict

gives this (which is what I expect to see from your screenshot above)

{'https://esgf-node.ipsl.upmc.fr/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esgf-index1.ceda.ac.uk/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esgf-data.dkrz.de/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esg-dn1.nsc.liu.se/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esgf-node.llnl.gov/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esgf.nci.org.au/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927'],
 'https://esgf-node.ornl.gov/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.day.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.day.pr.gn.v20191105',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Amon.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.3hr.pr.gn.v20180927',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.Amon.pr.gn.v20180927',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.3hr.pr.gn.v20191105',
  'CMIP6.HighResMIP.NERC.HadGEM3-GC31-HH.hist-1950.r1i1p1f1.day.pr.gn.v20180927']}
jbusecke commented 4 months ago

So my current conclusion is that thetao and so (the full depth outputs) do not exists as daily outputs.

What exists as daily output are surface temperature and surface salinity (and a bunch of atmos variables that you showed above):

parse_iids = [
    ...:     'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.*.*.Oday.*.gn.*',
    ...: ]
    ...: iids_dict = {}
    ...: search_nodes = [
    ...:     "https://esgf-node.ipsl.upmc.fr/esg-search/search",
    ...:     "https://esgf-index1.ceda.ac.uk/esg-search/search",
    ...:     ]
    ...: for node in search_nodes:
    ...: 
    ...:     iids = []
    ...:     for piid in parse_iids:
    ...:         iids.extend(parse_instance_ids(piid, search_node="https://esgf-node.ipsl.upmc.fr
    ...: /esg-search/search"))
    ...:     iids_dict[node] = iids
    ...: iids_dict
{'https://esgf-node.ipsl.upmc.fr/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.tossq.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.tos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.tossq.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.sos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.tos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.sos.gn.v20200514'],
 'https://esgf-index1.ceda.ac.uk/esg-search/search': ['CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.tossq.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.tos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.tossq.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.control-1950.r1i1p1f1.Oday.sos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.tos.gn.v20200514',
  'CMIP6.HighResMIP.MOHC.HadGEM3-GC31-HH.highres-future.r1i1p1f1.Oday.sos.gn.v20200514']}
rsignell commented 4 months ago

@jbusecke , OMG. I'm so sorry for this wild goose chase. You are correct -- If I search the CEDA node for so, there are only monthly files: image

Why isn't there an egg-on-face emoticon?

jbusecke commented 4 months ago

🥚🙈

jbusecke commented 4 months ago

FWIW the goose chase shamed me into adding some more CI and actually fixing the square bracket extension. So you can now search for a subset of facet values (or several) in a single string like this:

from pangeo_forge_esgf.parsing import parse_instance_ids

parse_iids = [
    'CMIP6.HighResMIP.*.HadGEM3-GC31-HH.*.*.*.[so, thetao, some_other_stuff].gn.*',
]
iids = []
for piid in parse_iids:
    iids.extend(parse_instance_ids(piid))
iids
jbusecke commented 4 months ago

Ill close this now.

rsignell commented 4 months ago

Thanks for being understanding and kind here @jbusecke !!