ua-snap / cmip6-utils

Pipelines and utilites for working with CMIP6 data
1 stars 1 forks source link

Transfers pipeline improvements #54

Closed Joshdpaul closed 5 months ago

Joshdpaul commented 5 months ago

This PR closes #39 , closes #38 , addresses #32, and closes #3.

Changes to model ensemble

The following models were added to the transfer pipeline:

and the following model was removed:

All files were successfully transferred by running the pipeline as described in the README, except the E3SM files. Though the E3SM models appear in the holdings, only the historical period is available on the LLNL ESGF2 globus node. Therefore the E3SM models did not have a variant chosen for the config.prod_variant_lu table and were not included in the batch files. We will need to keep #32 open for now and decide what to do about the incomplete data for the E3SM models.

Retry in ls utility

The utils.operation_ls() function was improved by incorporating a "retry" logic, to attempt 5 retries if the operation recieves an HTTP error. This seemed to fix the intermittent errors from issue #3 and resulted in a clean transfers run. By printing error statements from the holdings audit (see changes to esgf_holdings.get_filenames() function) and capturing the output in a text file (see the new esgf_holdings_output.txt) it should be possible to parse any persistent errors and potentially retry those specfic paths in a new transfer session. (This is somewhat related to issue #53 in that we may want to include the ability to audit/transfer smaller batches of files.) That output text file also prints errors for empty directories, which could be retried at a later date to check if more data has been uploaded to the globus node.

Allow multiple institutions & multiple grid types

The code was refactored to allow for including lists of institutions and grid types in config.py. Some functions were revised to return a list of dicts instead of just a dict, so that if multiple directories were encountered during ls operations, all contents of the directories could be audited. (Since there is no recursive ls function availble in the globus SDK, this was necessary in cases where we don't know which models have multiple grid types.) Using the new select_grids.ipynb notebook, we see that only two models (CNRM-CM6-1-HR and EC-Earth3-Veg) require more than one grid type in order to capture all frequency/variable combinations. Otherwise, models with more than one grid type have equal representation of frequency/variable combinations regardless of which grid type is chosen; in those cases, the native grid gn is always specified in the config.py.

In all cases where the config.model_inst_lu table is used, a conditional statement was added in order to allow for multiple institutions. This only applies to the MPI-ESM1-2-HR model, but the code now supports this situation in general without flagging this model explicitly. From what I can tell, these revisions to the transfers config.py should not affect the regridding or indicators pipelines since those use their own config.py files.

TO TEST:

I don't think you need to run the entire transfer, but you can if you really want to! What I'd really appreciate is a sanity check on the list logic in the esgf_holdings.py functions, and the conditional statements in generate_manifest.py, generate_batch_files.py, and tests/test_mirror.py. I think my testing confirms that this works for the cases we have identified, but are there other cases that could cause this to break or behave unexpectedly? Are there any obvious cases that are not captured by the conditional statements in esgf_holdings.get_filenames()?

Check out the new select_grids.ipynb to see the analysis. This notebook uses very similar logic and borrows some code from select_variants.ipynb.

Also, if there are any ideas to improve the printed error statements that end up in esgf_holdings_output.txt, let me know how that could be improved to make that output more useful in subsequent transfer attempts.