pangeo-forge / cmip6-pipeline

Pipeline for cloud-based CMIP6 data ingestion pipeline
Apache License 2.0
1 stars 5 forks source link

Test request for CMIP6 data processing and uploading to Cloud #4

Open naomi-henderson opened 4 years ago

naomi-henderson commented 4 years ago

@dgergel , I am going to start documenting our work together here so others can follow if interested.

Diana and I are working through a particular CMIP6 data request in order to rewrite/extend/fix the code started in CMIP6-collection.

Our test request (see the last entries in request sheet ).

This represents a project of mutual interest to obtain the following datasets for as many models as possible:

table_id = 'day' variable_id = ['pr', 'tasmin', 'tasmax'] experiment_id = ['historical', 'ssp126', 'ssp245', 'ssp370']

We were starting with the member_id = 'r1i1p1f1'. Many models do not have a member_id matching this value, so I have now extended our wish list to at least one ensemble member from each experiment_id/source_id combination (so that 'pr' is not from one run and 'tasmax' from another).

Note that there are separate rows for each variable - that is just so that I can process them in parallel.

naomi-henderson commented 4 years ago

I will just copy one of my original emails to Diana here to get us started.

Hi Diana,

It would be fantastic to get your help with the data processing steps. Let me first try to answer a few of your questions.

  • How long do the daily surface datasets take to download, convert and upload to GC? This is hard, but let me give you two examples that I tried today:

downloading: http://esgf-data3.ceda.ac.uk/thredds/fileServer/esg_cmip6/CMIP6/ScenarioMIP/CSIRO-ARCCSS/ACCESS-CM2/ssp126/r1i1p1f1/day/tasmax/gn/v20191108/tasmax_day_ACCESS-CM2_ssp126_r1i1p1f1_gn_*

2 netcdf file(s) at 40MB/s took 64s. (total size = 2.1G)

http://esg.lasg.ac.cn/thredds/fileServer/esg_dataroot/CMIP6/ScenarioMIP/CAS/FGOALS-g3/ssp370/r1i1p1f1/day/pr/gn/v20190820/pr_day_FGOALS-g3_ssp370_r1i1p1f1_gn_*

86 netcdf file(s) at 50KB/s took 10.2 hours. (total size = 1.8G)

processing: My computers are fast (multithreaded and new), so the concatenation and to_zarr on your datasets of 2-50G will take less than a few minutes per dataset. The historical CNRM-CM6-1-HR datasets are the largest, at 53G per variable and GISS-E2-1-G datasets are the smallest but might not be worth downscaling. Most are only about 10G. Uploading to GC from Columbia is also extremely fast. Again, a few minutes to less than 15 minutes per dataset.

cleaning and troubleshooting: Timewise, this is by far the most expensive step. The daily datasets you see in the cloud were never actually requested by anyone, they were just for demonstration purposes before the CMIP6 Hackathon last fall. We were really focussing on the large ocean datasets. There was a request for some of the daily pressure level data a few months ago, and these were a bit of a bother. Most of them are not mirrored to the faster ESGF-CoG nodes and so fewer people have worked with them and bugs exist which have not yet been addressed.

So, the bottleneck is really solving these issues on a model to model case. For some models, there are issues that the code will not be able to handle (e.g, downloading very large files using Globus or contacting the data providers to ask for missing netcdf files). But there are many issues that could be handled automatically and much more efficiently. Chiara Lepore and I worked together to get the 6hourly atmosphere data on model levels (these datasets are 1-2 TB each) and together we figured out how to fix most of the issues (some of them did not emerge until she actually tried to use the zarr stores) - but I never managed to transfer the fixes back into the processing code. All I did is (-cringe-) write special notebooks for each case.

What I will suggest is that I try to automatically process as many of your datasets as possible with my current python code, clunky as it is, and I can keep careful notes of the issues. I assume you are a much more proficient pythonista than I, so we could then try to re-work the processing part of the code, to properly handle the exceptions that occur as well as figure out how to automatically diagnose potential problems with the remaining datasets. Are you willing?

Cheers, Naomi

naomi-henderson commented 4 years ago

and a response to her latest email - should be the last in this format.

Hi Diana,

First, a response to your latest email, since I was almost ready to send what is below.

I don't use the nb3-DataCleaning notebook.  I combined its functionality with the DataRequest notebook (to process the fixing 'codes'). Only a very few people have submitted data cleaning issues, see CMIP6_DataExceptions (Responses).  Those have been added to the issues in https://github.com/naomi-henderson/cmip6collect/blob/master/notebooks/csv/exceptions.csv This is the file which is parsed for codes to indicate how to fix the data.  The daily data no longer has many issues since the bad files have been replaced by newer versions. 

What is of more interest are the notebooks to make the catalogs - that is where we look at the  ES-DOC Errata issues and report them in the 'noQC' catalog.  The data with serious issues are then removed from the official catalog.  I have been working on new code to compare versions and delete old version if new data is available - I will put it in the repo as soon as I get something working.

Cheers,Naomi

naomi-henderson commented 4 years ago

What follows is the inventory of what we have after blindly running the scripts.

I have created a google sheet listing the available datasets from our [ESGF API search at DKRZ](  https://docs.google.com/spreadsheets/d/1crbwP9uAyw4vyDgz3FcbHE69eVZq12PO8E5zfym-SNs/edit?usp=sharing)

I have commenting on the processing issues for each model.  Light blue means one is already in GC. Light green means there is a dataset available, but was not processed in the first sweep.

Here is what I have found so far:

DOWNLOADING ISSUES:

  1. The daily data is not so terribly hopeless - mostly a matter of choosing the ensemble member_id correctly.  Just getting 'r1i1p1f1' is overly restrictive.
  2. The FGOALS* models are still a problem, I guess I will let the scripts run in the background for however long it takes
  3. I don't know how to automate the download for IITM-ESM (bad certificate).  

PROCESSING ISSUES:  (from the 'csv/exceptions.csv' file except 4.)

  1. The CESM2 r1i1p1f1 netcdf files had trouble with the concatenation, but the datasets are now in the official ES-DOC ERRATA pages and have been replaced by new runs.
  2. The GFDL-CM4,historical,r1i1p1f1,day datasets had the following troubles: 'one (last) ncfile contains an extra height dimension to indicate it is at 10m'
  3. The EC-Earth3,historical,r20i1p1f1,day,pr netcdf files have some sort of HDF error, (also r24i1p1f1). But r1i1p1f1 worked fine
  4. NorESM2-LM,ssp126,r1i1p1f1: time concatenation results in some duplicated times (need join=exact?)

Here are a few possible solutions I see for picking a particular member_id:

Solution 1. (2626 datasets needed)  Download all ensemble members Solution 2. (371 datasets).  Have a clever algorithm to find one of the ensemble members with the most datasets for each experiment, see, e.g.,  CESM2-WACCM for an example of the complexity Solution 3. Download 'r1i1p1f[1-3]' members. Make special requests for the CESM2* models, specifying the desired ensemble member

This just leaves the CESM2* datasets. The initial CESM2 r1i1p1f1 runs had errors, so their new 'default' runs are r10i1p1f1 and r11i1p1f1. CESM2-WACCM does not have an obvious choice, since we prefer using all three variables from the same run.

naomi-henderson commented 4 years ago

The duplicate times problem in the NorESM2-LM ssps daily data is due to extra erroneous netcdf files that are in the same directory. 3 of the 12 netcdf files are near duplicates - the 3 bad files end in '...1230.nc' and the 9 good files end in '...1231.nc. The bad files should be removed prior to concatenation. For example, these two files are both in NorESM2-LM/ssp126/r1i1p1f1/day/pr/gn/v20191108:

http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/ScenarioMIP/NCC/NorESM2-LM/ssp126/r1i1p1f1/day/pr/gn/v20191108/pr_day_NorESM2-LM_ssp126_r1i1p1f1_gn_20310101-20401230.nc

http://noresg.nird.sigma2.no/thredds/fileServer/esg_dataroot/cmor/CMIP6/ScenarioMIP/NCC/NorESM2-LM/ssp126/r1i1p1f1/day/pr/gn/v20191108/pr_day_NorESM2-LM_ssp126_r1i1p1f1_gn_20310101-20401231.nc

@dgergel , it would be great to have a test which checks for these duplicated (overlapping) times in the future. For now I will add this to the exceptions.csv with code='remove_files' which should flag my processing code to remove the extra files prior to concatenation

dgergel commented 4 years ago

@naomi-henderson this is all fantastic, thank you!! More soon, and my apologies for the slow reply, have had a lot going on this week.

naomi-henderson commented 4 years ago

@dgergel - no apologies needed, just wanted to do a brain dump here while I had some time! I still have one more notebook (nb2-UpdateVersions.ipynb) that needs some work before you spend too much time on it.

dgergel commented 4 years ago

hi @naomi-henderson - I am back to working on this, any updates I should be aware of since our conversation? Looks like you made some updates to the nb2-UpdateVersions.ipynb notebook since your last comment?

rabernat commented 4 years ago

I thought I would let you all know that @jhamman and I have been at work on https://github.com/pangeo-forge/pangeo-forge. This is in very early stages, but the goal is to have a general-purpose framework for doing these sorts of ETL tasks. Eventually we will want to try to get the CMIP6 workflow to use pangeo-forge. But for now, I would recommend that you carry on as you are.

naomi-henderson commented 4 years ago

hi @dgergel - Yes, there are a few changes and since I am continually updating the CMIP6 GC repo, using these notebooks and python functions, there will be changes in my own working directory. If I were more proficient, I would fork (or branch) the version you are looking at and then submit pull requests (or merge) whenever the code is more stable ... Sorry about that. How would you prefer to work on this? It is my guess that, once you start re-writing and cleaning up the code, my original code (and the small changes) will not be important. Ideally I would start testing your new version and add to it any of the recent changes (such as the code for special cases needed for pre-processing datasets with problems).

I am not really sure what @rabernat meant by us carrying on - so I am basically ignoring his comment, but delighted that they are attempting a proper ETL framework!

rabernat commented 4 years ago

I am not really sure what @rabernat meant by us carrying on

What I meant is that our pangeo-forge efforts are not mature enough to try to integrate with the CMIP6 pipeline at this stage. We are starting with much simpler cases, like making a pipeline for NOAA AVHRR OISST. However, we will be watching what you do here carefully in order to guide our design.

dgergel commented 4 years ago

@rabernat thanks for sharing the pangeo-forge progress, looks exciting. I took a look so I'd have some idea of where you are headed.

@naomi-henderson I think since I am not adding to your notebooks or repo but rather working on refactoring your code and will be pushing it to this repo, it might be best if you just keep pushing updates to your notebooks as you make them. If the code isn't too stable, for my purposes that is ok. Does that seem reasonable to you?

naomi-henderson commented 4 years ago

@dgergel - perfect, thanks

dgergel commented 4 years ago

@naomi-henderson I am struggling a little with figuring out which versions of your notebooks you are currently using, since some have been renamed but are very similar in terms of their structure. I started a paper doc with an outline of the current workflow, including notebooks, modules (just script names for now with groups of functions), and classes of exceptions. The list of notebooks though is outdated, it's from when I first started going through the workflow a month ago - I'm in the process of updating it.

So I'm able to tell which notebooks are currently being used, I suggest either you update the names to make it clear or make a note of it in the paper doc that I shared above. Does that seem reasonable?

naomi-henderson commented 4 years ago

@dgergel , yes I see some confusion here. My current workflow is very different than you have outlined. I willl modify the document, but perhaps our terminology is different so I think we better agree on what we mean by ESGF, etc.

dgergel commented 4 years ago

@naomi-henderson I see you are already editing the paper doc, thank you! Yes most of my work on this was from about a month ago, and it was a fairly rough outline.

What did you see as the problem with terminology? Definitely I agree we should make sure we're on the same page in that regard.

naomi-henderson commented 4 years ago

Okay, I agree my naming convention is odd - mostly from the evolution from proof-of-concept to actually-happening and the need to run multiple requests simultaneously on multiple machines. But basically there are three categories - nb1, nb2 and nb3. An 'e' means extra and nb3 happens in two stages, nb3a and nb3b.

Hopefully you can make sense out of this.

Nomenclature issues are mostly the ambiguity of the function of ESGF (both the 4 ESGF search nodes and the 30 ESGF data nodes). The ES-DOC ERRATA, our Google Request Form, and our Google Errata Form add to the confusion.