leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 6 forks source link

Add MODIS-COSP recipe #68

Open jbusecke opened 11 months ago

jbusecke commented 11 months ago
jbusecke commented 11 months ago

pre-commit.ci autofix

jbusecke commented 11 months ago

Ok got the recipe deployed and cached the files. So the auth part works 🎉

But I am running into issues on dataflow now:

apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 907, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 908, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1419, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1507, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1417, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 623, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1581, in apache_beam.runners.common._OutputHandler.handle_process_outputs
  File "apache_beam/runners/common.py", line 1694, in apache_beam.runners.common._OutputHandler._write_value_to_tag
  File "apache_beam/runners/worker/operations.py", line 240, in apache_beam.runners.worker.operations.SingletonElementConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 1238, in apache_beam.runners.worker.operations.PGBKCVOperation.process
  File "apache_beam/runners/worker/operations.py", line 1267, in apache_beam.runners.worker.operations.PGBKCVOperation.process
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/combiners.py", line 33, in add_input
    accumulator.add_input(schema, position)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/pangeo_forge_recipes/aggregation.py", line 69, in add_input
    s["chunks"][self.concat_dim] = {position: s["dims"][self.concat_dim]}
KeyError: "time [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|StoreToZarr/StoreToZarr/DetermineSchema/CombineGlobally(CombineXarraySchemas)/KeyWithVoid-ptransform-63']"

I am fairly sure this is due to the fact that the files do not actually have a time dimension (each file represents a time step, but is only a 2d lat/lon array). I am wondering what is the best path forward:

cisaacstern commented 11 months ago

Write a per-file preprocessor that adds a scalar coordinate based on the file metadata (Ill try that in a sec)

Yes, this what I recommend, and is how I have solved this type of problem myself, for example:

https://github.com/pangeo-forge/aqua-modis-feedstock/blob/9b6a543b3e79bf7a11f9d8926a4bac7e8a671929/feedstock/recipe.py#L70-L72

jbusecke commented 11 months ago

Oh crap, this is an even bigger issue. They use netcdf groups 😩. When I naively load the file with xarray I get:

image

Paging @TomNicholas in hopes there is some datatree/xarray wizardry that might help us here!

TomNicholas commented 11 months ago

Hey, happy to have a look, but I'm missing a lot of context! Obviously you guys know you can use datatree to open a netcdf files with groups then look at the groups / extract each as a dataset. What is the problem?

jbusecke commented 11 months ago

All groups share the lon/lat coordinates, but they are weirdly stored at the root node, and then each node could have additional dimensions:

bunch of crappy ncdump screenshots

💡

hold on let me just use datatree to get a better repr:

```python DataTree('None', parent=None) │ Dimensions: (latitude: 180, longitude: 360) │ Coordinates: │ * latitude (latitude) float64 -89.5 -88.5 -87.5 -86.5 ... 87.5 88.5 89.5 │ * longitude (longitude) float64 -179.5 -178.5 -177.5 ... 177.5 178.5 179.5 │ Data variables: │ *empty* │ Attributes: (12/51) │ YAML_config: grid_settings:\n gridsize: 1\n proje... │ Yori_version: 1.5.0 │ input_files: MCD06COSP_D3_MODIS.A2008336.062.202212... │ daily_defn_of_day_adjustment: False │ history: │ source: idl 8.4, mcd06cosp_preyori 20220218-1,... │ ... ... │ longitude_resolution: 1.0 │ license: http://science.nasa.gov/earth-science/... │ stdname_vocabulary: NetCDF Climate and Forecast (CF) Metad... │ keywords_vocabulary: NASA Global Change Master Directory (G... │ keywords: EARTH SCIENCE > ATMOSPHERE > CLOUDS > ... │ naming_authority: gov.nasa.gsfc.sci.atmos ├── DataTree('Solar_Zenith') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Solar Zenith Angle (Cell to Sun) for Daytime Scenes │ units: degrees │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 180.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Solar_Azimuth') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Solar Azimuth Angle (Cell to Sun) for Daytime Scenes │ units: degrees │ _FillValue: -999.0 │ valid_min: -180.0 │ valid_max: 180.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Sensor_Zenith') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Sensor Zenith Angle (Cell to Sensor) for Daytime Scenes │ units: degrees │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 180.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Sensor_Azimuth') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Sensor Azimuth Angle (Cell to Sensor) for Daytime Scenes │ units: degrees │ _FillValue: -999.0 │ valid_min: -180.0 │ valid_max: 180.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Top_Pressure') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Top Pressure for Daytime Scenes │ units: mb │ _FillValue: -999.0 │ valid_min: 1.0 │ valid_max: 1100.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Mask_Fraction') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Fraction from Cloud Mask for Daytime Scenes │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Mask_Fraction_Low') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Fraction from Cloud Mask (Low Clouds, CTP GE 680 hPa... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Mask_Fraction_Mid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Fraction from Cloud Mask (Mid Clouds, CTP GE 440 hPa... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Mask_Fraction_High') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Fraction from Cloud Mask (High Clouds, CTP LT 440 hP... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Liquid') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_liquid_7: 7, │ jhisto_cloud_particle_size_liquid_6: 6, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_liquid_7, │ jhisto_cloud_particle_size_liquid_6, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_Liquid (longitude, latitude, jhisto_cloud_optical_thickness_liquid_7, jhisto_cloud_particle_size_liquid_6) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_liquid_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Liquid Water Clouds (3.7 micro... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Ice') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_ice_7: 7, │ jhisto_cloud_particle_size_ice_6: 6, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_ice_7, │ jhisto_cloud_particle_size_ice_6, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_Ice (longitude, latitude, jhisto_cloud_optical_thickness_ice_7, jhisto_cloud_particle_size_ice_6) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_ice_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Ice Clouds (3.7 micron Retriev... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Total') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_total_7: 7, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_total_7, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_total_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Combined (LiquidWater+Ice+Unde... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_PCL_Liquid') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_pcl_liquid_7: 7, │ jhisto_cloud_particle_size_pcl_liquid_6: 6, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_pcl_liquid_7, │ jhisto_cloud_particle_size_pcl_liquid_6, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_PCL_Liquid (longitude, latitude, jhisto_cloud_optical_thickness_pcl_liquid_7, jhisto_cloud_particle_size_pcl_liquid_6) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_pcl_liquid_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Liquid Water Phase Clouds (3.7... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_PCL_Ice') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_pcl_ice_7: 7, │ jhisto_cloud_particle_size_pcl_ice_6: 6, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_pcl_ice_7, │ jhisto_cloud_particle_size_pcl_ice_6, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_PCL_Ice (longitude, latitude, jhisto_cloud_optical_thickness_pcl_ice_7, jhisto_cloud_particle_size_pcl_ice_6) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_pcl_ice_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Ice Phase Clouds (3.7 micron R... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_PCL_Total') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_optical_thickness_pcl_total_7: 7, │ jhisto_cloud_top_pressure_7: 7) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_optical_thickness_pcl_total_7, │ jhisto_cloud_top_pressure_7 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Top_Pressure (longitude, latitude, jhisto_cloud_optical_thickness_pcl_total_7, jhisto_cloud_top_pressure_7) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness for Combined (LiquidWater+Ice+Unde... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 150.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Log10_Liquid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness Log10 for Liquid Water Clouds (3.7... │ units: none │ _FillValue: -999.0 │ valid_min: -2.0 │ valid_max: 2.176 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Log10_Ice') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness Log10 for Ice Clouds (3.7 micron R... │ units: none │ _FillValue: -999.0 │ valid_min: -2.0 │ valid_max: 2.176 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Optical_Thickness_Log10_Total') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Thickness Log10 for Combined (LiquidWater+Ic... │ units: none │ _FillValue: -999.0 │ valid_min: -2.0 │ valid_max: 2.176 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Particle_Size_Liquid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Effective Radius for Liquid Water Clouds (3.7 micron... │ units: microns │ _FillValue: -999.0 │ valid_min: 4.0 │ valid_max: 30.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Particle_Size_Ice') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Effective Radius for Ice Clouds (3.7 micron Retrieva... │ units: microns │ _FillValue: -999.0 │ valid_min: 5.0 │ valid_max: 60.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Particle_Size_PCL_Liquid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Effective Radius for Liquid Water Clouds (3.7 micron... │ units: microns │ _FillValue: -999.0 │ valid_min: 4.0 │ valid_max: 30.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Particle_Size_PCL_Ice') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Effective Radius for Ice Clouds (3.7 micron Retrieva... │ units: microns │ _FillValue: -999.0 │ valid_min: 5.0 │ valid_max: 60.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Water_Path_Liquid') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_water_path_liquid_7: 7, │ jhisto_cloud_particle_size_liquid_6: 6) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_water_path_liquid_7, │ jhisto_cloud_particle_size_liquid_6 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_Liquid (longitude, latitude, jhisto_cloud_water_path_liquid_7, jhisto_cloud_particle_size_liquid_6) float64 ... │ Attributes: │ long_name: Cloud Water Path for Liquid Water Clouds (3.7 micron Retri... │ units: g/m^2 │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 3000.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Water_Path_Ice') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_water_path_ice_7: 7, │ jhisto_cloud_particle_size_ice_6: 6) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_water_path_ice_7, │ jhisto_cloud_particle_size_ice_6 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_Ice (longitude, latitude, jhisto_cloud_water_path_ice_7, jhisto_cloud_particle_size_ice_6) float64 ... │ Attributes: │ long_name: Cloud Water Path for Ice Clouds (3.7 micron Retrieval for ... │ units: g/m^2 │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 6000.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Water_Path_PCL_Liquid') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_water_path_pcl_liquid_7: 7, │ jhisto_cloud_particle_size_pcl_liquid_6: 6) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_water_path_pcl_liquid_7, │ jhisto_cloud_particle_size_pcl_liquid_6 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_PCL_Liquid (longitude, latitude, jhisto_cloud_water_path_pcl_liquid_7, jhisto_cloud_particle_size_pcl_liquid_6) float64 ... │ Attributes: │ long_name: Cloud Water Path for Liquid Water Clouds (3.7 micron Retri... │ units: g/m^2 │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 3000.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Water_Path_PCL_Ice') │ Dimensions: (longitude: 360, latitude: 180, │ jhisto_cloud_water_path_pcl_ice_7: 7, │ jhisto_cloud_particle_size_pcl_ice_6: 6) │ Dimensions without coordinates: longitude, latitude, │ jhisto_cloud_water_path_pcl_ice_7, │ jhisto_cloud_particle_size_pcl_ice_6 │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ JHisto_vs_Cloud_Particle_Size_PCL_Ice (longitude, latitude, jhisto_cloud_water_path_pcl_ice_7, jhisto_cloud_particle_size_pcl_ice_6) float64 ... │ Attributes: │ long_name: Cloud Water Path for Ice Clouds (3.7 micron Retrieval for ... │ units: g/m^2 │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 6000.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Retrieval_Fraction_Liquid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Properties Retrieval Fraction (Liquid Water ... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Retrieval_Fraction_Ice') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Properties Retrieval Fraction (Ice Clouds) │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Retrieval_Fraction_Total') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Properties Retrieval Fraction (Combined (Liq... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Retrieval_Fraction_PCL_Liquid') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Properties Retrieval Fraction (Liquid Water ... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 ├── DataTree('Cloud_Retrieval_Fraction_PCL_Ice') │ Dimensions: (longitude: 360, latitude: 180) │ Dimensions without coordinates: longitude, latitude │ Data variables: │ Mean (longitude, latitude) float64 ... │ Standard_Deviation (longitude, latitude) float64 ... │ Sum (longitude, latitude) float64 ... │ Pixel_Counts (longitude, latitude) float64 ... │ Sum_Squares (longitude, latitude) float64 ... │ Attributes: │ long_name: Cloud Optical Properties Retrieval Fraction (Ice Clouds) f... │ units: none │ _FillValue: -999.0 │ valid_min: 0.0 │ valid_max: 1.0 │ scale_factor: 1.0 │ add_offset: 0.0 └── DataTree('Cloud_Retrieval_Fraction_PCL_Total') Dimensions: (longitude: 360, latitude: 180) Dimensions without coordinates: longitude, latitude Data variables: Mean (longitude, latitude) float64 ... Standard_Deviation (longitude, latitude) float64 ... Sum (longitude, latitude) float64 ... Pixel_Counts (longitude, latitude) float64 ... Sum_Squares (longitude, latitude) float64 ... Attributes: long_name: Cloud Optical Properties Retrieval Fraction (Combined Clou... units: none _FillValue: -999.0 valid_min: 0.0 valid_max: 1.0 scale_factor: 1.0 add_offset: 0.0 ```
cisaacstern commented 11 months ago

@jbusecke I previously did a deep dive on this dataset using the pre-beam code, and ultimately came up with the code at the bottom of https://github.com/pangeo-forge/staged-recipes/issues/125#issuecomment-1077053600 as a semi-workable solution, which as you'll see creates a zarr store for each group. In Beam, there should be a better way of doing this, which may or may not benefit from Datatree.

jbusecke commented 11 months ago

Update for @TomNicholas: I initially thought all groups have the same dimensions and was wondering if we can brute force them into a single dataset.

The broader question here is how we deal with datatrees/groups in pangeo-forge I guess, and I thought this would intersect your interest at the moment?

But back to the discussion of this dataset: I am wondering if this could/should be compressed to a dataset instead of a tree?

cisaacstern commented 11 months ago

But back to the discussion of this dataset: I am wondering if this could/should be compressed to a dataset instead of a tree?

If so, we could do something like:


class OpenWithDatatree(beam.PTransform):
    ...

class DatatreeToDataset(beam.PTransform):

    def expand(pcoll: PCollection[Datatree]):
        # combine the data tree nodes into a single dataset here
        ds = ...
        return ds

recipe = (
    ...
    | OpenURLWithFSSpec()
    | OpenWithDatatree()
    | DatatreeToDataset()
    | StoreToZarr()
)

If not, I'd say just make one zarr store per group.

jbusecke commented 11 months ago

@jbusecke I previously did a deep dive on this dataset using the pre-beam code, and ultimately came up with the code at the bottom of pangeo-forge/staged-recipes#125 (comment) as a semi-workable solution, which as you'll see creates a zarr store for each group. In Beam, there should be a better way of doing this, which may or may not benefit from Datatree.

Ughhh, I now realize that there is more history to this. Should have looked before diving in. Sorry about that.

creates a zarr store for each group

But that is not strictly necessary, right? Is it worth thinking about a datatree->nested zarr pipeline? Maybe that is in the end overkill. It might however raise some interesting edge cases for datatree (can we do a dt.concat([dt1, dt2, ...], dim='time'?).

TomNicholas commented 11 months ago

If you wanted to add support for datatree to pangeo-forge, I would start with simple unambiguous io functions and only do combining of datatree objects dataset-by-dataset.

can we do a dt.concat([dt1, dt2, ...], dim='time'?).

This kind of operation is not yet implemented in datatree because its too ambiguous as written.

jbusecke commented 11 months ago

If not, I'd say just make one zarr store per group.

Probably the most viable method for now. Thinking about how to achieve that.

I guess we have two choices here

repeated read and filter by groupname

class OpenWithDatatree(beam.PTransform):
    ...

@dataclass
class DatatreeGroupToDataset(beam.PTransform):
    :param: var ....

    def expand(pcoll: PCollection[Datatree]):
        # combine the data tree nodes into a single dataset here
        ds = select_single_group_from_datatree(var=var)
        return ds

recipe_a = (
    ...
    | OpenURLWithFSSpec()
    | OpenWithDatatree()
    | DatatreeGroupToDataset(var=a)
    | StoreToZarr()
)

recipe_b = (
    ...
    | OpenURLWithFSSpec()
    | OpenWithDatatree()
    | DatatreeGroupToDataset(var=a)
    | StoreToZarr()
)

Seems like a pain in the butt to maintain...but might be more straightforward to implement (we could use the dictobj work from the CMIP6 feedstock to loop over groupnames and generate a recipe dict?).

emit multiple datasets from datatree and group the results before storing We will somehow have to keep track of the store name to write each dataset too (with some sort of key?). Seems much harder but also kind of interesting!

class OpenWithDatatree(beam.PTransform):
    ...

class SplitGroupsToDatasets(beam.PTransform):
    def expand(pcoll: PCollection[Datatree]):
        list = split_dt() # this would be a list of ('var', ds_var) tuples maybe?
       return list # Not sure how to properly emit multiple outputs per input here

recipe = (
    ...
    | OpenURLWithFSSpec()
    | OpenWithDatatree()
    | SplitGroupsToDatasets()
    | GroupByVar() # this has to group all datasets that belong to each group/store (there will be multiple time steps).
    | StoreToZarr(target_store='somehow generated from the grouped keys?')
)
jbusecke commented 11 months ago

This kind of operation is not yet implemented in datatree because its too ambiguous as written.

Yup that makes sense in general. I guess we can add 'time slices of identical tree structures' to the list of subclasses where these operations would actually be non-ambiguous because they have certain properties (similarly to the 'hollow' CMIP6 trees)...but that is a tangent.

I think that at the moment we do not really need any new features from datatree to achieve a workable solution.

cisaacstern commented 11 months ago

| StoreToZarr(target_store='somehow generated from the grouped keys?')

The somehow generated part of this is not workable based on anything I have seen/tried yet (and I've gone down this 🐰 🕳️ a bit!).

Adapting your first option a bit more concisely, I would suggest


@dataclass
class ModisCospRecipe(beam.PTransform):
    var: str

    def expand(self, pattern: PCollection):
        return (
            pattern
            | OpenURLWithFSSpec()
            | OpenWithDatatree()
            | DatatreeGroupToDataset(var=self.var)
            | StoreToZarr()
        )

pattern = ... # same pattern for all recipes
recipe_a = beam.Create(pattern.items()) | ModisCospRecipe(var="var_a")
recipe_b = beam.Create(pattern.items()) | ModisCospRecipe(var="var_b")
cisaacstern commented 11 months ago

xref https://github.com/pangeo-forge/pangeo-forge-recipes/issues/498