pangeo-data / pangeo-cmip6-cloud

Documentation for Pangeo CMIP6 data stored in GCP/AWS cloud
https://pangeo-data.github.io/pangeo-cmip6-cloud/
17 stars 9 forks source link

CIL/Rhodium CMIP6 Dataset Requests #38

Closed cisaacstern closed 2 years ago

cisaacstern commented 2 years ago

@jbusecke @kemccusker @rfofrich @delgadom @dgergel

Let's start by creating a list of DATASET_IDs which can be run with https://github.com/pangeo-data/pangeo-cmip6-cloud/blob/master/zarr_from_esgf.py.

DATASET_ID, as defined in #31 is:

activityid.institute_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_klabel.version

When Pangeo Forge Cloud is ready, I will ping this thread with ideas for migrating this work there.

cisaacstern commented 2 years ago

If CIL/Rhodium team can provide @jbusecke with one DATASET_ID to start with, he can run the script to see if it will work for these datasets.

cisaacstern commented 2 years ago

Here is a link to the tutorial for running recipes locally:

https://pangeo-forge.readthedocs.io/en/latest/introduction_tutorial/intro_tutorial_part2.html#create-the-recipe-object

Please let me know if anything is unclear.

delgadom commented 2 years ago

I got the zarr_from_esgf.py script to run with the suggested DATASET_ID "CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Omon.so.gn.v20190429"

I just grabbed a CMIP/historical sim we've worked with: CMIP6.CMIP.CMCC.CMCC-CM2-SR5.historical.r1i1p1f1.day.tasmax.gn.v20200616. This is already in the pangeo cloud store here: gs://cmip6/CMIP6/CMIP/CMCC/CMCC-CM2-SR5/historical/r1i1p1f1/day/tasmax/gn/v20200616

I don't have one of the no-anthro forcing specs handy, but this should be similar to the ones we'd want to use. When I run this, I get the errors parsing the ESGF API response:

$ python zarr_from_esgf.py CMIP6.CMIP.CMCC.CMCC-CM2-SR5.historical.r1i1p1f1.day.tasmax.gn.v20200616
empty search response
Traceback (most recent call last):
  File "/Users/mikedelgado/opt/miniconda3/envs/rhodium-env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'url'

As far as I can tell, ESGF returns no results for the query that this call builds:

https://esgf-node.llnl.gov/esg-search/search/?activity_id=CMIP&institution_id=CMCC&source_id=CMCC-CM2-SR5&experiment_id=historical&member_id=r1i1p1f1&table_id=day&variable_id=tasmax&grid_label=gn&project=CMIP6&type=File&distrib=false&format=application%2Fsolr%2Bjson&limit=500&offset=0

I've just done a tiny bit of poking around... as far as I can tell the culprit is the type=File on mysearch.py#L30 keyword. If you change this to type=Dataset, the query does return a listing which looks right to me, but the parser then chokes on mysearch.py#57 because there isn't a dataset_id in the response.

Wondering if digging into this is productive or if you've already solved this issue a different way?

delgadom commented 2 years ago

huh... ok well @rfofrich just sent me a sample ID from the DAMIP experiments we're trying to use and this one worked! I'm still confused about the above example but it's not a current pain point....

Here's the DATASET_ID: CMIP6.DAMIP.CSIRO-ARCCSS.ACCESS-CM2.hist-nat.r1i1p1f1.day.tas.gn.v20201120

this ran the workflow locally for me (using the patch in #39) when testing using

recipe.copy_pruned().to_function()()

@jbusecke does this give you enough to start with? or should we work on a full list?

delgadom commented 2 years ago

For @rfofrich (and anyone else wanting to run this package) - here's my quickstart for testing a recipe:

  1. clone this repo

  2. [until #39 is merged] delete lines 96-110 from zarr_from_esgf.py - you should remove all of the following:

    
    fs_local = LocalFileSystem()
    
    target_dir = tempfile.TemporaryDirectory().name + ".zarr"
    target = FSSpecTarget(fs_local, target_dir)
    
    cache_dir = tempfile.TemporaryDirectory()
    cache_target = CacheFSSpecTarget(fs_local, cache_dir.name)
    
    meta_dir = tempfile.TemporaryDirectory()
    meta_store = MetadataTarget(fs_local, meta_dir.name)
    
    recipe.target = target
    recipe.input_cache = cache_target
    recipe.metadata_cache = meta_store

    Also change the print from print(target_dir) to print(recipe.target)

  3. change the execution line so it only runs the sample "pruned" workflow:

    # recipe.to_function()()
    recipe.copy_pruned().to_function()()
  4. install the dependencies. one of the conda environments on pangeo-forge-recipes seems like a good place to catch 'em all. You'll also need pangeo-forge-recipes itself: pip install pangeo-forge-recipes

  5. Finally, run zarr_from_esgf.py, passing in your DATASET_ID as a positional argument, e.g.:

    python zarr_from_esgf.py CMIP6.DAMIP.CSIRO-ARCCSS.ACCESS-CM2.hist-nat.r1i1p1f1.day.tas.gn.v20201120
rfofrich commented 2 years ago

@cisaacstern @jbusecke I think we have what we need to move forward with this. I'm attaching an excel file with all the DAMIP models/simulations needed for the project. Each column of the excel sheet has the necessary information to construct a DATASET_ID for that model/ensemble member. Let me know if you have any questions/concerns or if any simulation gives you any issues. CMIP6_DAMIP_hist_nat_temp.xlsx

cisaacstern commented 2 years ago

Thanks @rfofrich. Julius and I have some time scheduled to look at this together on Friday. We'll update you here once we've been able to make some headway.

rfofrich commented 2 years ago

Sounds great! Thank you both.

rfofrich commented 2 years ago

@cisaacstern Hello, thanks again for helping with this. Just wanted to circle back and see if there were any updates.

cisaacstern commented 2 years ago

@rfofrich, thanks for checking in and apologies for the delayed reply. @jbusecke and I have migrated this work to https://github.com/pangeo-forge/cmip6-feedstock. I realize it's a bit redundant, but just so we have everything in one place, could I ask you to open a new issue on that repository requesting we work on the list of IDs you provided in https://github.com/pangeo-data/pangeo-cmip6-cloud/issues/38#issuecomment-1103206197?

A small point, but when you do so could you link the list of requested IDs as a GitHub Gist or similar form which is readable in-browser without download? (It will just be a bit easier to work with that way.)

I admit I'm not clear on what your preferred timeline is for this, so perhaps you could make a note of that in the new issue as well. Whether or not we, as a small team with a lot of other work on our plates, will be able to achieve that timeline is another matter of course, but once I know what it is, I'll certainly give you an honest assessment of that.

cisaacstern commented 2 years ago

I realize it's a bit redundant, but just so we have everything in one place, could I ask you to open a new issue on that repository requesting we work on the list of IDs you provided in https://github.com/pangeo-data/pangeo-cmip6-cloud/issues/38#issuecomment-1103206197?

@rfofrich, I'm working on this today, so I just went ahead and created this new tracker issue: https://github.com/pangeo-forge/cmip6-feedstock/issues/5

To everyone following this thread: I'm going to close this Issue now. @jbusecke and I will provide future updates on this topic on the new issue linked above. Thanks so much for your engagement and enthusiasm. I expect we'll have some progress to share within another week or so.