metno / pyaerocom

Python tools for climate and air quality model evaluation
https://pyaerocom.readthedocs.io/
GNU General Public License v3.0
25 stars 13 forks source link

Pangeo/CMIP6 reader in Pyaerocom #603

Closed Ovewh closed 8 months ago

Ovewh commented 2 years ago

The pangeo cloud service has archived a lot of CMIP6 data, which is accessible through an intake catalog. There is an example of how this intake catalog works here. This intake catalog is built using the intake-esm plugin.

The purposed new feature is to develop a new reader which takes in search catalog (browsing the catalog can be done using the intake library), which then gets extracted from the Pangeo cloud service using the intake library. The xarray dataset retrieved from the pangeo service then has to be converted from ESGF dkrz format to aerocom format and then into a GriddedData object.

jgriesfeller commented 2 years ago

Some comments / thoughts:

avaldebe commented 2 years ago
  • do we want to cache the data or just read it as we need it? The former needs a writer class we don't have

intake caches the data, and can be configured to use a common cache pool (e.g. on lustre)

avaldebe commented 2 years ago

@Ovewh please look at your ~/.intake/cache and let us know what is cached

avaldebe commented 2 years ago

@MichaelSchulzMETNO and @jgriesfeller can you check if we can mount betsy over sshfs?

Ovewh commented 2 years ago

@Ovewh please look at your ~/.intake/cache and let us know what is cached

I have been checking out the examples notebooks provided by intake-esm and (pangeo)[http://gallery.pangeo.io/repos/pangeo-gallery/cmip6/global_mean_surface_temp.html]. There is no caching of the data, so in other words, the ~/.intake/cache folder does not exist and the analysis takes the same amount of time to run each time.

avaldebe commented 2 years ago

@Ovewh please look at your ~/.intake/cache and let us know what is cached

I have been checking out the examples notebooks provided by [...]. There is no caching of the data, so in other words, the ~/.intake/cache folder does not exist and the analysis takes the same amount of time to run each time.

Did you run the notebooks on your machine or on some remote notebook server?

Ovewh commented 2 years ago

@Ovewh please look at your ~/.intake/cache and let us know what is cached

I have been checking out the examples notebooks provided by [...]. There is no caching of the data, so in other words, the ~/.intake/cache folder does not exist and the analysis takes the same amount of time to run each time.

Did you run the notebooks on your machine or on some remote notebook server?

I run the notebooks locally on my own computer

Ovewh commented 2 years ago

Also regarding the performance and speed, the pangeo team has published an interesting comparison of the speed retrieving data from the different datafromats and services http://gallery.pangeo.io/repos/earthcube2020/ec20_abernathey_etal/cloud_storage.html
Though we won't achieve, the same speeds, due to a much less optimal connection it is still impressive that you can achieve read speeds of 5000mb/s.

MichaelSchulzMETNO commented 2 years ago

I agree - no intake cache folder found on my Mac when executing this google intake_esm notebook: https://intake-esm.readthedocs.io/en/stable/user-guide/cmip6-tutorial.html

MichaelSchulzMETNO commented 2 years ago

@MichaelSchulzMETNO and @jgriesfeller can you check if we can mount betsy over sshfs?

I tried - we can sshfs mount the CMIP6 folder on betzy via nird. Which means the CMIP6 files on betzy dont need to be copied to lustre-metno to be accessed. The question remains, is this quicker then the intake method via zar... needs to be tested.

jgriesfeller commented 2 years ago

This will work only on the laptops. There's no sshfs on PPI (I tried only rhel8 machines since the others will be gone in less than a year)

Ovewh commented 2 years ago

I did a small proof of concept, using pangeo and pyaerocom. For two models it is no slower than using sshfs mount to access. It is relatively straight forward to convert the xarray objects into iris objects using the to_iris() functin. I had some issues regarding what metadata to pass along to the iris cube object and translating some of the CMIP6 attributes into pyaercom attributes. Searching the database can be done using the intake catalog and can be filtered using a dictionary.

#!/usr/bin/env python
# coding: utf-8

# additional dependencies:
# intake
# intake-esm
# gcsfs

import intake
from pyaerocom.griddeddata import GriddedData
from pyaerocom.io.iris_io import check_and_regrid_lons_cube
from pyaerocom.colocation import colocate_gridded_ungridded
import pyaerocom as pya
import matplotlib.pyplot as plt
import iris

col=intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")

cat = col.search(experiment_id=['historical'], table_id='AERmon', variable_id='od550aer',
                 grid_label='gn',source_id=['MPI-ESM-1-2-HAM','NorESM2-LM','EC-Earth3-AerChem'])
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True})
print(list(dset_dict.keys()))
ds_MPI=dset_dict['CMIP.HAMMOZ-Consortium.MPI-ESM-1-2-HAM.historical.AERmon.gn']
ds_NorESM=dset_dict['CMIP.NCC.NorESM2-LM.historical.AERmon.gn']
cube_mpi=ds_MPI['od550aer'].isel(member_id=0).to_iris()

cube_mpi=check_and_regrid_lons_cube(cube_mpi)

cube_NorESM=ds_NorESM['od550aer'].isel(member_id=0).to_iris()

cube_NorESM=check_and_regrid_lons_cube(cube_NorESM)

aeronet_reader=pya.io.ReadUngridded('AeronetSunV3Lev2.daily')
aeronet_sun =  aeronet_reader.read(vars_to_retrieve=['od550aer'])
east_asia = aeronet_sun.filter_region('EAS')

gd_mpi = GriddedData(cube_mpi,ts_type='monthly', data_id=ds_MPI.source_id)
gd_NorESM = GriddedData(cube_NorESM,ts_type='monthly', data_id=ds_NorESM.source_id)
colocated_mpi = colocate_gridded_ungridded(gd_mpi, east_asia, ts_type='monthly', start=2005)
colocated_NorESM = colocate_gridded_ungridded(gd_NorESM, east_asia, ts_type='monthly', start=2005)

fig,ax = plt.subplots(ncols=2,figsize=(14.2*1.5,6*1.5))
colocated_mpi.plot_scatter(ax=ax[0])
colocated_NorESM.plot_scatter(ax=ax[1])
plt.savefig('mpi_noresm_pangeo_validation.png',bbox_inches='tight')
github-actions[bot] commented 9 months ago

This issue is stale because it has been open for 365 days with no activity. This issue will be closed in 14 days if no action is taken.

github-actions[bot] commented 8 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.