Open RobertPincus opened 2 years ago
Thanks for this proposal, @RobertPincus.
If I understand correctly, there are at least two related (but distinct) goals outlined here:
Generally, Pangeo Forge is focused on being very good at goal 1: producing optimized mirrors of archival datasets (in complete form). Once this is accomplished, goal 2 becomes much easier, as the data will be staged in manner conducive to scalable parallel computation.
In terms of goal 1 (producing the cloud optimized mirror), I note that programmatic distribution of this data is available via OPeNDAP. This would suggest to me that producing a Zarr dataset may be our best option. Kerchunk would be a more efficient option if the data were already staged as netCDFs. Given that is not the case, producing a kerchunk index would entail storing the dataset as netCDFs on the cloud, and then producing the kerchunk indexes. A less efficient (two step) process as compared to directly creating a Zarr store from the OPeNDAP endpoint.
If starting with creation of a Zarr copy of this dataset is an acceptable starting place, we can use XarrayZarrRecipe
to accomplish this. This recipe class supports OPeNDAP inputs.
Is working on this recipe (a few dozen lines of Python code) something you or someone in your group is interested in? If so I can point to the relevant documentation for getting started. If not, we can open this up to others (myself included, perhaps) to collaborate on this development, though note that this latter option may take a bit longer to get spun up.
Looking forward to bringing this vision to life!
@cisaacstern, thanks very much for this feedback.
As a point of clarification, the data already contains the means, joint histograms, etc. that we want - they are just accessed via netCDF groups.
One wrinkle in the ointment is that the file names contain the date of production. Since we don't know this date a priori it amounts to a quasi-random string. Do you know if there's a way in OpenDAP to specify opening files that match a certain pattern including a wildcard?
I'm open to outputting Zarr; if people want to recycle the recipe to make local netCDF mirrors that'll be easy enough. I don't yet understand if, say, one Zarr object is roughly equivalent to a netCDF file, or if a single object could include many variables.
For a first try you can certainly point me to documentation and I can see how far I can get.
Thanks a lot.
One wrinkle in the ointment is that the file names contain the date of production. Since we don't know this date a priori it amounts to a quasi-random string.
This is a really annoying feature of many datasets. Do we know if the hyrax server exposes a TDS catalog or any other catalog? If so, we could crawl it to populate the FilePattern.
I'll see if I can find out about a TDS catalog. JSON files are provided, at least (top level).
@RobertPincus thanks for the clarification. Here is the documentation on recipe contribution. (This published just this morning, so if anything doesn't make sense, that's my fault! Please let me know if so and I will amend.)
Re: your question about what a Zarr store can represent, a single Zarr store can include as many variables as we want, so long as they exist on the same time dimension.
As you'll see in the linked docs, you'll want to define a Recipe Object (in this case, an XarrayZarrRecipe
), which requires a FilePattern
as input. The FilePattern
itself requires a url format function as input, which is a Python function that can create a valid url path to the source data based on, e.g., a date input.
I've worked out a start for this format function based on the (very helpful!) JSON catalog link you provided:
import pandas as pd
import requests
BASE_URL = "http://ladsweb.modaps.eosdis.nasa.gov"
DATASET_ID = "61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
def make_url(date):
"""Make an OPeNDAP url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(
f"{BASE_URL}/archive/allData/{DATASET_ID}/{date.year}/{day_of_year}.json"
)
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/opendap/hyrax/allData/{DATASET_ID}/{date.year}/{day_of_year}/{filename}"
This function faithfully reproduces the example url you provided in your first comment on this thread:
url = make_url(dates[0])
print(url)
http://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc
However I get an error when trying to open this URL with xarray:
import xarray as xr
ds = xr.open_dataset(url)
syntax error, unexpected WORD_WORD, expecting ';' or ','
context: Attributes { latitude { Float64 _FillValue -999.0000000000000; String units "degrees_north"; } longitude { Float64 _FillValue -999.0000000000000; String units "degrees_east"; } NC_GLOBAL { String YAML_config "grid_settings: gridsize: 1 projection: conformal lat_in: Latitude lon_in: Longitude lat_out: Latitude lon_out: Longitude fill_value: -999variable_settings: - name_in: Solar_Zenith name_out: Solar_Zenith attributes: - name: long_name value: Solar Zenith Angle (Cell to Sun) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Solar_Azimuth name_out: Solar_Azimuth attributes: - name: long_name value: Solar Azimuth Angle (Cell to Sun) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: -180.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Sensor_Zenith name_out: Sensor_Zenith attributes: - name: long_name value: Sensor Zenith Angle (Cell to Sensor) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Sensor_Azimuth name_out: Sensor_Azimuth attributes: - name: long_name value: Sensor Azimuth Angle (Cell to Sensor) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: -180.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Top_Pressure name_out: Cloud_Top_Pressure attributes: - name: long_name value: Cloud Top Pressure for Daytime Scenes - name: units value: mb - name: _FillValue value: -999.0 - name: valid_min value: 1.0 - name: valid_max value: 1100.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction attributes: - name: long_name value: Cloud Fraction from Cloud Mask for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_Low attributes: - name: long_name value: Cloud Fraction from Cloud Mask (Low, CTP GE 680 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_Low - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_Mid attributes: - name: long_name value: Cloud Fraction from Cloud Mask (Mid, 680 hPa GT CTP GE 440 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_Middle - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_High attributes: - name: long_name value: Cloud Fraction from Cloud Mask (High, CTP LT 440 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_High - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Liquid attributes: - name: long_name value: Cloud Optical Thickness for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Particle_Size_Liquid primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Effective_Radius edges: [4.0, 8.0, 10.0, 13.0, 15.0, 20.0, 30.0] masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Ice attributes: - name: long_name value: Cloud Optical Thickness for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Particle_Size_Ice primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Effective_Radius edges: [5.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0] masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Total attributes: - name: long_name value: Cloud Optical Thickness for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Top_Pressure primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Top_Pressure edges: [0.0, 180.0, 310.0, 440.0, 560.0, 680.0, 800.0, 10000.0] masks: - Mask_Valid_Range_CER - Mask_Combined_Phase_Clouds - name_in: Cloud_Optical_Thickness_PCL name_out: Cloud_Optical_Thickness_PCL_Total only_histograms: attributes: - name: long_name value: Cloud Optical Thickness for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Partly Cloudy (PCL) Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Top_Pressure primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Top_Pressure edges: [0.0, 180.0, 310.0, 440.0, 560.0, 680.0, 800.0, 10000.0] masks: - Mask_Valid_Range_CERPCL - Mask_Combined_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Liquid attributes: - name: long_name value: Cloud Optical Thickness Log10 for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Ice attributes: - name: long_name value: Cloud Optical Thickness Log10 for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Total attributes: - name: long_name value: Cloud Optical Thickness Log10 for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Combined_Phase_Clouds - name_in: Cloud_Effective_Radius name_out: Cloud_Particle_Size_Liquid attributes: - name: long_name value: Cloud Effective Radius for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: microns - name: _FillValue value: -999.0 - name: valid_min value: 4.0 - name: valid_max value: 30.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Effective_Radius name_out: Cloud_Particle_Size_Ice attributes: - name: long_name value: Cloud Effective Radius for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: microns - name: _FillValue value: -999.0 - name: valid_min value: 5.0 - name: valid_max value: 60.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Water_Path name_out: Cloud_Water_Path_Liquid attributes: - name: long_name value: Cloud Water Path for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: g/m^2 - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 3000.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Water_Path name_out: Cloud_Water_Path_Ice attributes: - name: long_name value: Cloud Water Path for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: g/m^2 - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 6000.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: COPR_Liquid name_out: Cloud_Retrieval_Fraction_Liquid attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Liquid Water Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 - name_in: COPR_Ice name_out: Cloud_Retrieval_Fraction_Ice attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Ice Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 - name_in: COPR_Combined name_out: Cloud_Retrieval_Fraction_Total attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Combined (LiquidWater+Ice+Undetermined) Phase Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0"; String Yori_version "1.3.16"; String daily_defn_of_day_adjustment "False"; String input_files "MCD06COSP_D3_MODIS.A2002185.061.2020179074148.nc,MCD06COSP_D3_MODIS.A2002186.061.2020179074020.nc,MCD06COSP_D3_MODIS.A2002187.061.2020179080105.nc,MCD06COSP_D3_MODIS.A2002188.061.2020179073800.nc,MCD06COSP_D3_MODIS.A2002189.061.2020179075527.nc,MCD06COSP_D3_MODIS.A2002190.061.2020181140712.nc,MCD06COSP_D3_MODIS.A2002191.061.2020179073354.nc,MCD06COSP_D3_MODIS.A2002192.061.2020181140657.nc,MCD06COSP_D3_MODIS.A2002193.061.2020181140639.nc,MCD06COSP_D3_MODIS.A2002194.061.2020181140633.nc,MCD06COSP_D3_MODIS.A2002195.061.2020179073600.nc,MCD06COSP_D3_MODIS.A2002196.061.2020179071759.nc,MCD06COSP_D3_MODIS.A2002197.061.2020179073136.nc,MCD06COSP_D3_MODIS.A2002198.061.2020181140638.nc,MCD06COSP_D3_MODIS.A2002199.061.2020179073626.nc,MCD06COSP_D3_MODIS.A2002200.061.2020181140632.nc,MCD06COSP_D3_MODIS.A2002201.061.2020181140623.nc,MCD06COSP_D3_MODIS.A2002202.061.2020179073345.nc,MCD06COSP_D3_MODIS.A2002203.061.2020179072223.nc,MCD06COSP_D3_MODIS.A2002204.061.2020179072036.nc,MCD06COSP_D3_MODIS.A2002205.061.2020179074935.nc,MCD06COSP_D3_MODIS.A2002206.061.2020179072758.nc,MCD06COSP_D3_MODIS.A2002207.061.2020179074751.nc,MCD06COSP_D3_MODIS.A2002208.061.2020179074110.nc,MCD06COSP_D3_MODIS.A2002209.061.2020179073958.nc,MCD06COSP_D3_MODIS.A2002210.061.2020181140608.nc,MCD06COSP_D3_MODIS.A2002211.061.2020181140441.nc,MCD06COSP_D3_MODIS.A2002212.061.2020181140457.nc"; String history ""; String source "idl 8.4, mcd06cosp_preyori 20191204-1, yori 1.3.16"; String date_created "2020-06-29T14:58:03Z"; String product_name "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"; String LocalGranuleID "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"; String Conventions "CF-1.6, ACDD-1.3"; String ShortName "MCD06COSP_M3_MODIS"; String product_version "6.1.2"; String AlgorithmType "OPS"; String identifier_product_doi "10.5067/MODIS/MCD06COSP_M3_MODIS.061"; String identifier_product_doi_authority "http://dx.doi.org/"; String ancillary_files ""; String DataCenterId "UWI-MAD/SSEC/ASIPS"; String project "NASA VIIRS Atmosphere SIPS"; String creator_name "NASA VIIRS Atmosphere SIPS"; String creator_url "https://sips.ssec.wisc.edu/"; String creator_email "sips.support@ssec.wisc.edu"; String creator_institution "Space Science & Engineering Center, University of Wisconsin - Madison"; String publisher_name "LAADS"; String publisher_url "https://ladsweb.modaps.eosdis.nasa.gov/"; String publisher_email "modis-ops@lists.nasa.gov"; String publisher_institution "NASA Level-1 and Atmosphere Archive & Distribution System"; String time_coverage_start "2002-07-01T00:00:00.000000"; String time_coverage_end "2002-07-31T23:59:59.000000"; String xmlmetadata "<?xml version="1.0"^?><!DOCTYPE GranuleMetaDataFile SYSTEM "http://ecsinfo.gsfc.nasa.gov/ECSInfo/ecsmetadata/dtds/DPL/ECS/ScienceGranuleMetadata.dtd"><GranuleMetaDataFile> <DTDVersion>1.0</DTDVersion> <DataCenterId>UWI-MAD/SSEC/ASIPS</DataCenterId> <GranuleURMetaData> <CollectionMetaData> <ShortName>MCD06COSP_M3_MODIS</ShortName> <VersionID>61</VersionID> </CollectionMetaData> <ECSDataGranule> <ReprocessingPlanned>no further reprocessing anticipated</ReprocessingPlanned> <LocalGranuleID>MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc</LocalGranuleID> <ProductionDateTime>2020-06-29 14:58:49.491586</ProductionDateTime> <LocalVersionID>61</LocalVersionID> </ECSDataGranule> <PGEVersionClass> <PGEVersion>6.1.2</PGEVersion> </PGEVersionClass> <RangeDateTime> <RangeEndingTime>23:59:59.000000</RangeEndingTime> <RangeEndingDate>2002-07-31</RangeEndingDate> <RangeBeginningTime>00:00:00.000000</RangeBeginningTime> <RangeBeginningDate>2002-07-01</RangeBeginningDate> </RangeDateTime> <SpatialDomainContainer> <HorizontalSpatialDomainContainer> <BoundingRectangle> <WestBoundingCoordinate>-180</WestBoundingCoordinate> <NorthBoundingCoordinate>90</NorthBoundingCoordinate> <EastBoundingCoordinate>180</EastBoundingCoordinate> <SouthBoundingCoordinate>-90</SouthBoundingCoordinate> </BoundingRectangle> </HorizontalSpatialDomainContainer> </SpatialDomainContainer> <Platform> <PlatformShortName>Suomi NPP</PlatformShortName> <Instrument> <InstrumentShortName>VIIRS</InstrumentShortName> <Sensor> <SensorShortName>VIIRS</SensorShortName> </Sensor> </Instrument> </Platform> <InputGranule> <InputPointer>MCD06COSP_D3_MODIS.A2002185.061.2020179074148.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002186.061.2020179074020.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002187.061.2020179080105.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002188.061.2020179073800.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002189.061.2020179075527.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002190.061.2020181140712.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002191.061.2020179073354.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002192.061.2020181140657.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002193.061.2020181140639.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002194.061.2020181140633.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002195.061.2020179073600.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002196.061.2020179071759.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002197.061.2020179073136.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002198.061.2020181140638.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002199.061.2020179073626.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002200.061.2020181140632.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002201.061.2020181140623.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002202.061.2020179073345.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002203.061.2020179072223.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002204.061.2020179072036.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002205.061.2020179074935.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002206.061.2020179072758.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002207.061.2020179074751.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002208.061.2020179074110.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002209.061.2020179073958.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002210.061.2020181140608.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002211.061.2020181140441.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002212.061.2020181140457.nc</InputPointer> </InputGranule> <AncillaryInputGranules> </AncillaryInputGranules> </GranuleURMetaData></GranuleMetaDataFile>"; String platform "Aqua, Terra"; String instrument "MODIS"; String processing_level "L3"; String format "NetCDF4"; String title "Aqua/Terra MODIS Cloud Properties Level 3 monthly, 1x1 degree grid (MCD06COSP_M3_MODIS)"; String long_name "MODIS (Aqua/Terra) Cloud Properties Level 3 monthly, 1x1 degree grid"; String version_id "061"; Float64 geospatial_lat_max 90.00000000000000; Float64 geospatial_lat_min -90.00000000000000; Float64 geospatial_lon_min 180.0000000000000; Float64 geospatial_lon_max -180.0000000000000; Float64 NorthBoundingCoordinate 90.00000000000000; Float64 SouthBoundingCoordinate -90.00000000000000; Float64 EastBoundingCoordinate 180.0000000000000; Float64 WestBoundingCoordinate -180.0000000000000; Float64 latitude_resolution 1.000000000000000; Float64 longitude_resolution 1.000000000000000; String license "http://science.nasa.gov/earth-science/earth-science-data/data-information-policy/"; String stdname_vocabulary "NetCDF Climate and Forecast (CF) Metadata Convention"; String keywords_vocabulary "NASA Global Change Master Directory (GCMD) Science Keywords"; String keywords "EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD MICROPHYSICS > CLOUD OPTICAL DEPTH/THICKNESS, EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD TOP HEIGHT, EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD FRACTION"; String naming_authority "gov.nasa.gsfc.sci.atmos"; }}
Illegal attribute
context: Attributes { latitude { Float64 _FillValue -999.0000000000000; String units "degrees_north"; } longitude { Float64 _FillValue -999.0000000000000; String units "degrees_east"; } NC_GLOBAL { String YAML_config "grid_settings: gridsize: 1 projection: conformal lat_in: Latitude lon_in: Longitude lat_out: Latitude lon_out: Longitude fill_value: -999variable_settings: - name_in: Solar_Zenith name_out: Solar_Zenith attributes: - name: long_name value: Solar Zenith Angle (Cell to Sun) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Solar_Azimuth name_out: Solar_Azimuth attributes: - name: long_name value: Solar Azimuth Angle (Cell to Sun) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: -180.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Sensor_Zenith name_out: Sensor_Zenith attributes: - name: long_name value: Sensor Zenith Angle (Cell to Sensor) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Sensor_Azimuth name_out: Sensor_Azimuth attributes: - name: long_name value: Sensor Azimuth Angle (Cell to Sensor) for Daytime Scenes - name: units value: degrees - name: _FillValue value: -999.0 - name: valid_min value: -180.0 - name: valid_max value: 180.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Top_Pressure name_out: Cloud_Top_Pressure attributes: - name: long_name value: Cloud Top Pressure for Daytime Scenes - name: units value: mb - name: _FillValue value: -999.0 - name: valid_min value: 1.0 - name: valid_max value: 1100.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction attributes: - name: long_name value: Cloud Fraction from Cloud Mask for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_Low attributes: - name: long_name value: Cloud Fraction from Cloud Mask (Low, CTP GE 680 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_Low - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_Mid attributes: - name: long_name value: Cloud Fraction from Cloud Mask (Mid, 680 hPa GT CTP GE 440 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_Middle - name_in: Cloud_Fraction name_out: Cloud_Mask_Fraction_High attributes: - name: long_name value: Cloud Fraction from Cloud Mask (High, CTP LT 440 hPa) for Daytime Scenes - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Day - Mask_High - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Liquid attributes: - name: long_name value: Cloud Optical Thickness for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Particle_Size_Liquid primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Effective_Radius edges: [4.0, 8.0, 10.0, 13.0, 15.0, 20.0, 30.0] masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Ice attributes: - name: long_name value: Cloud Optical Thickness for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Particle_Size_Ice primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Effective_Radius edges: [5.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0] masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Optical_Thickness name_out: Cloud_Optical_Thickness_Total attributes: - name: long_name value: Cloud Optical Thickness for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Top_Pressure primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Top_Pressure edges: [0.0, 180.0, 310.0, 440.0, 560.0, 680.0, 800.0, 10000.0] masks: - Mask_Valid_Range_CER - Mask_Combined_Phase_Clouds - name_in: Cloud_Optical_Thickness_PCL name_out: Cloud_Optical_Thickness_PCL_Total only_histograms: attributes: - name: long_name value: Cloud Optical Thickness for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Partly Cloudy (PCL) Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 150.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 2D_histograms: - name_out: JHisto_vs_Cloud_Top_Pressure primary_var: edges: [0.0, 0.3, 1.3, 3.6, 9.4, 23.0, 60.0, 150.0] joint_var: name_in: Cloud_Top_Pressure edges: [0.0, 180.0, 310.0, 440.0, 560.0, 680.0, 800.0, 10000.0] masks: - Mask_Valid_Range_CERPCL - Mask_Combined_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Liquid attributes: - name: long_name value: Cloud Optical Thickness Log10 for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Ice attributes: - name: long_name value: Cloud Optical Thickness Log10 for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Optical_Thickness_Log name_out: Cloud_Optical_Thickness_Log10_Total attributes: - name: long_name value: Cloud Optical Thickness Log10 for Combined (LiquidWater+Ice+Undetermined) Phase Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: -2.0 - name: valid_max value: 2.176 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Combined_Phase_Clouds - name_in: Cloud_Effective_Radius name_out: Cloud_Particle_Size_Liquid attributes: - name: long_name value: Cloud Effective Radius for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: microns - name: _FillValue value: -999.0 - name: valid_min value: 4.0 - name: valid_max value: 30.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Effective_Radius name_out: Cloud_Particle_Size_Ice attributes: - name: long_name value: Cloud Effective Radius for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: microns - name: _FillValue value: -999.0 - name: valid_min value: 5.0 - name: valid_max value: 60.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: Cloud_Water_Path name_out: Cloud_Water_Path_Liquid attributes: - name: long_name value: Cloud Water Path for Liquid Water Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: g/m^2 - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 3000.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Valid_Range_CER - Mask_Liquid_Water_Phase_Clouds - name_in: Cloud_Water_Path name_out: Cloud_Water_Path_Ice attributes: - name: long_name value: Cloud Water Path for Ice Clouds (3.7 micron Retrieval for Cloudy Scenes) - name: units value: g/m^2 - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 6000.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 masks: - Mask_Ice_Phase_Clouds - name_in: COPR_Liquid name_out: Cloud_Retrieval_Fraction_Liquid attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Liquid Water Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 - name_in: COPR_Ice name_out: Cloud_Retrieval_Fraction_Ice attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Ice Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0 - name_in: COPR_Combined name_out: Cloud_Retrieval_Fraction_Total attributes: - name: long_name value: Cloud Optical Properties Retrieval Fraction (Combined (LiquidWater+Ice+Undetermined) Phase Clouds) - name: units value: none - name: _FillValue value: -999.0 - name: valid_min value: 0.0 - name: valid_max value: 1.0 - name: scale_factor value: 1.0 - name: add_offset value: 0.0"; String Yori_version "1.3.16"; String daily_defn_of_day_adjustment "False"; String input_files "MCD06COSP_D3_MODIS.A2002185.061.2020179074148.nc,MCD06COSP_D3_MODIS.A2002186.061.2020179074020.nc,MCD06COSP_D3_MODIS.A2002187.061.2020179080105.nc,MCD06COSP_D3_MODIS.A2002188.061.2020179073800.nc,MCD06COSP_D3_MODIS.A2002189.061.2020179075527.nc,MCD06COSP_D3_MODIS.A2002190.061.2020181140712.nc,MCD06COSP_D3_MODIS.A2002191.061.2020179073354.nc,MCD06COSP_D3_MODIS.A2002192.061.2020181140657.nc,MCD06COSP_D3_MODIS.A2002193.061.2020181140639.nc,MCD06COSP_D3_MODIS.A2002194.061.2020181140633.nc,MCD06COSP_D3_MODIS.A2002195.061.2020179073600.nc,MCD06COSP_D3_MODIS.A2002196.061.2020179071759.nc,MCD06COSP_D3_MODIS.A2002197.061.2020179073136.nc,MCD06COSP_D3_MODIS.A2002198.061.2020181140638.nc,MCD06COSP_D3_MODIS.A2002199.061.2020179073626.nc,MCD06COSP_D3_MODIS.A2002200.061.2020181140632.nc,MCD06COSP_D3_MODIS.A2002201.061.2020181140623.nc,MCD06COSP_D3_MODIS.A2002202.061.2020179073345.nc,MCD06COSP_D3_MODIS.A2002203.061.2020179072223.nc,MCD06COSP_D3_MODIS.A2002204.061.2020179072036.nc,MCD06COSP_D3_MODIS.A2002205.061.2020179074935.nc,MCD06COSP_D3_MODIS.A2002206.061.2020179072758.nc,MCD06COSP_D3_MODIS.A2002207.061.2020179074751.nc,MCD06COSP_D3_MODIS.A2002208.061.2020179074110.nc,MCD06COSP_D3_MODIS.A2002209.061.2020179073958.nc,MCD06COSP_D3_MODIS.A2002210.061.2020181140608.nc,MCD06COSP_D3_MODIS.A2002211.061.2020181140441.nc,MCD06COSP_D3_MODIS.A2002212.061.2020181140457.nc"; String history ""; String source "idl 8.4, mcd06cosp_preyori 20191204-1, yori 1.3.16"; String date_created "2020-06-29T14:58:03Z"; String product_name "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"; String LocalGranuleID "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"; String Conventions "CF-1.6, ACDD-1.3"; String ShortName "MCD06COSP_M3_MODIS"; String product_version "6.1.2"; String AlgorithmType "OPS"; String identifier_product_doi "10.5067/MODIS/MCD06COSP_M3_MODIS.061"; String identifier_product_doi_authority "http://dx.doi.org/"; String ancillary_files ""; String DataCenterId "UWI-MAD/SSEC/ASIPS"; String project "NASA VIIRS Atmosphere SIPS"; String creator_name "NASA VIIRS Atmosphere SIPS"; String creator_url "https://sips.ssec.wisc.edu/"; String creator_email "sips.support@ssec.wisc.edu"; String creator_institution "Space Science & Engineering Center, University of Wisconsin - Madison"; String publisher_name "LAADS"; String publisher_url "https://ladsweb.modaps.eosdis.nasa.gov/"; String publisher_email "modis-ops@lists.nasa.gov"; String publisher_institution "NASA Level-1 and Atmosphere Archive & Distribution System"; String time_coverage_start "2002-07-01T00:00:00.000000"; String time_coverage_end "2002-07-31T23:59:59.000000"; String xmlmetadata "<?xml version="1.0"^?><!DOCTYPE GranuleMetaDataFile SYSTEM "http://ecsinfo.gsfc.nasa.gov/ECSInfo/ecsmetadata/dtds/DPL/ECS/ScienceGranuleMetadata.dtd"><GranuleMetaDataFile> <DTDVersion>1.0</DTDVersion> <DataCenterId>UWI-MAD/SSEC/ASIPS</DataCenterId> <GranuleURMetaData> <CollectionMetaData> <ShortName>MCD06COSP_M3_MODIS</ShortName> <VersionID>61</VersionID> </CollectionMetaData> <ECSDataGranule> <ReprocessingPlanned>no further reprocessing anticipated</ReprocessingPlanned> <LocalGranuleID>MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc</LocalGranuleID> <ProductionDateTime>2020-06-29 14:58:49.491586</ProductionDateTime> <LocalVersionID>61</LocalVersionID> </ECSDataGranule> <PGEVersionClass> <PGEVersion>6.1.2</PGEVersion> </PGEVersionClass> <RangeDateTime> <RangeEndingTime>23:59:59.000000</RangeEndingTime> <RangeEndingDate>2002-07-31</RangeEndingDate> <RangeBeginningTime>00:00:00.000000</RangeBeginningTime> <RangeBeginningDate>2002-07-01</RangeBeginningDate> </RangeDateTime> <SpatialDomainContainer> <HorizontalSpatialDomainContainer> <BoundingRectangle> <WestBoundingCoordinate>-180</WestBoundingCoordinate> <NorthBoundingCoordinate>90</NorthBoundingCoordinate> <EastBoundingCoordinate>180</EastBoundingCoordinate> <SouthBoundingCoordinate>-90</SouthBoundingCoordinate> </BoundingRectangle> </HorizontalSpatialDomainContainer> </SpatialDomainContainer> <Platform> <PlatformShortName>Suomi NPP</PlatformShortName> <Instrument> <InstrumentShortName>VIIRS</InstrumentShortName> <Sensor> <SensorShortName>VIIRS</SensorShortName> </Sensor> </Instrument> </Platform> <InputGranule> <InputPointer>MCD06COSP_D3_MODIS.A2002185.061.2020179074148.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002186.061.2020179074020.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002187.061.2020179080105.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002188.061.2020179073800.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002189.061.2020179075527.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002190.061.2020181140712.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002191.061.2020179073354.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002192.061.2020181140657.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002193.061.2020181140639.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002194.061.2020181140633.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002195.061.2020179073600.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002196.061.2020179071759.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002197.061.2020179073136.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002198.061.2020181140638.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002199.061.2020179073626.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002200.061.2020181140632.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002201.061.2020181140623.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002202.061.2020179073345.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002203.061.2020179072223.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002204.061.2020179072036.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002205.061.2020179074935.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002206.061.2020179072758.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002207.061.2020179074751.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002208.061.2020179074110.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002209.061.2020179073958.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002210.061.2020181140608.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002211.061.2020181140441.nc</InputPointer> <InputPointer>MCD06COSP_D3_MODIS.A2002212.061.2020181140457.nc</InputPointer> </InputGranule> <AncillaryInputGranules> </AncillaryInputGranules> </GranuleURMetaData></GranuleMetaDataFile>"; String platform "Aqua, Terra"; String instrument "MODIS"; String processing_level "L3"; String format "NetCDF4"; String title "Aqua/Terra MODIS Cloud Properties Level 3 monthly, 1x1 degree grid (MCD06COSP_M3_MODIS)"; String long_name "MODIS (Aqua/Terra) Cloud Properties Level 3 monthly, 1x1 degree grid"; String version_id "061"; Float64 geospatial_lat_max 90.00000000000000; Float64 geospatial_lat_min -90.00000000000000; Float64 geospatial_lon_min 180.0000000000000; Float64 geospatial_lon_max -180.0000000000000; Float64 NorthBoundingCoordinate 90.00000000000000; Float64 SouthBoundingCoordinate -90.00000000000000; Float64 EastBoundingCoordinate 180.0000000000000; Float64 WestBoundingCoordinate -180.0000000000000; Float64 latitude_resolution 1.000000000000000; Float64 longitude_resolution 1.000000000000000; String license "http://science.nasa.gov/earth-science/earth-science-data/data-information-policy/"; String stdname_vocabulary "NetCDF Climate and Forecast (CF) Metadata Convention"; String keywords_vocabulary "NASA Global Change Master Directory (GCMD) Science Keywords"; String keywords "EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD MICROPHYSICS > CLOUD OPTICAL DEPTH/THICKNESS, EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD TOP HEIGHT, EARTH SCIENCE > ATMOSPHERE > CLOUDS > CLOUD PROPERTIES > CLOUD FRACTION"; String naming_authority "gov.nasa.gsfc.sci.atmos"; }}
And the resulting dataset has no variables:
print(ds)
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360)
Coordinates:
* latitude (latitude) float64 -89.5 -88.5 -87.5 -86.5 ... 87.5 88.5 89.5
* longitude (longitude) float64 -179.5 -178.5 -177.5 ... 177.5 178.5 179.5
Data variables:
*empty*
Perhaps I am missing some essential keyword argument(s) for xr.open_dataset
?
@cisaacstern Thanks for this. I'll catch up later this week, but meanwhile, perhaps you can try with engine=netcdf4
keywords to xr.open_dataset
?
This error message is coming from the netCDF4 C library.
import netCDF4
url = 'http://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
ds = netCDF4.Dataset(url, "r")
This means that the Hyrax server is emitting data that cannot be properly parsed by the official Unidata netCDF4 library. This is a problem with the server and needs to be brought to the attention of the NASA system administrator.
Is there a direct link to netCDF file download (rather than OPeNDAP endpoint)?
One can access the files through a GUI by appending .dmr.html
. That provides a button where one can download the data in several formats, but I haven't been able to see the underlying URLs yet.
That website - https://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc.dmr.html - does not show any data variables either, just lon and lat.
To get variables one has to open a group, i.e. Cloud_Optical_Thickness_Liquid
I cannot discover any groups from that opendap url.
import netCDF4
url = 'http://ladsweb.modaps.eosdis.nasa.gov/opendap/hyrax/allData/61/MCD06COSP_M3_MODIS/2002/182/MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
ds = netCDF4.Dataset(url, "r")
print(ds.groups) # --> {}
Where are you inputting the group information when you access the data?
My attention is a little split these days, sorry. Like you both, I have been unable to open the files remotely via OpenDAP. I will see what I can learn from NASA but they have not been very responsive. I will also see if I can sleuth out direct download links, which I have not been able to find anywhere obvious.
Once the files is downloaded I've been able to see data with e.g.
import array as xr
file = 'MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc'
f = xr.open_dataset(file, engine='netcdf4', group='Cloud_Mask_Fraction')
Now the server is returning a 500 server error
The html page is back up. Inspecting the source there reveals that appending .dap.nc4
is the path to direct download:
<input type="button" value="Get as NetCDF 4" onclick="getAs_button_action('NetCDF-4 Data', '.dap.nc4')">
Amending the earlier make_url
function accordingly, I can now download the source files. What appear to be the group names ('Cloud_Mask_Fraction'
, etc.) are discoverable in the dataset's ds.YAML_config
attribute, but none of these names are openable as groups using the syntax provided in https://github.com/pangeo-forge/staged-recipes/issues/125#issuecomment-1075838942:
import fsspec
import pandas as pd
import requests
import xarray as xr
import yaml
BASE_URL = "http://ladsweb.modaps.eosdis.nasa.gov"
DATASET_ID = "61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
def make_url(date):
"""Make a NetCDF4 download url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(
f"{BASE_URL}/archive/allData/{DATASET_ID}/{date.year}/{day_of_year}.json"
)
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/opendap/hyrax/allData/{DATASET_ID}/{date.year}/{day_of_year}/{filename}.dap.nc4"
test_filename = "test.nc"
with fsspec.open(make_url(dates[0])) as src:
with open(test_filename, mode="wb") as dst:
dst.write(src.read())
ds = xr.open_dataset(test_filename, engine='netcdf4')
yaml_config = yaml.safe_load(ds.YAML_config)
group_name_pairs = [(v["name_in"], v["name_out"]) for v in yaml_config["variable_settings"]]
for pair in group_name_pairs:
for group in pair:
try:
ds = xr.open_dataset(test_filename, engine='netcdf4', group=group)
except OSError as e:
print(e)
[Errno group not found: Solar_Zenith] 'Solar_Zenith'
[Errno group not found: Solar_Zenith] 'Solar_Zenith'
[Errno group not found: Solar_Azimuth] 'Solar_Azimuth'
[Errno group not found: Solar_Azimuth] 'Solar_Azimuth'
[Errno group not found: Sensor_Zenith] 'Sensor_Zenith'
[Errno group not found: Sensor_Zenith] 'Sensor_Zenith'
[Errno group not found: Sensor_Azimuth] 'Sensor_Azimuth'
[Errno group not found: Sensor_Azimuth] 'Sensor_Azimuth'
[Errno group not found: Cloud_Top_Pressure] 'Cloud_Top_Pressure'
[Errno group not found: Cloud_Top_Pressure] 'Cloud_Top_Pressure'
[Errno group not found: Cloud_Fraction] 'Cloud_Fraction'
[Errno group not found: Cloud_Mask_Fraction] 'Cloud_Mask_Fraction'
[Errno group not found: Cloud_Fraction] 'Cloud_Fraction'
[Errno group not found: Cloud_Mask_Fraction_Low] 'Cloud_Mask_Fraction_Low'
[Errno group not found: Cloud_Fraction] 'Cloud_Fraction'
[Errno group not found: Cloud_Mask_Fraction_Mid] 'Cloud_Mask_Fraction_Mid'
[Errno group not found: Cloud_Fraction] 'Cloud_Fraction'
[Errno group not found: Cloud_Mask_Fraction_High] 'Cloud_Mask_Fraction_High'
[Errno group not found: Cloud_Optical_Thickness] 'Cloud_Optical_Thickness'
[Errno group not found: Cloud_Optical_Thickness_Liquid] 'Cloud_Optical_Thickness_Liquid'
[Errno group not found: Cloud_Optical_Thickness] 'Cloud_Optical_Thickness'
[Errno group not found: Cloud_Optical_Thickness_Ice] 'Cloud_Optical_Thickness_Ice'
[Errno group not found: Cloud_Optical_Thickness] 'Cloud_Optical_Thickness'
[Errno group not found: Cloud_Optical_Thickness_Total] 'Cloud_Optical_Thickness_Total'
[Errno group not found: Cloud_Optical_Thickness_PCL] 'Cloud_Optical_Thickness_PCL'
[Errno group not found: Cloud_Optical_Thickness_PCL_Total] 'Cloud_Optical_Thickness_PCL_Total'
[Errno group not found: Cloud_Optical_Thickness_Log] 'Cloud_Optical_Thickness_Log'
[Errno group not found: Cloud_Optical_Thickness_Log10_Liquid] 'Cloud_Optical_Thickness_Log10_Liquid'
[Errno group not found: Cloud_Optical_Thickness_Log] 'Cloud_Optical_Thickness_Log'
[Errno group not found: Cloud_Optical_Thickness_Log10_Ice] 'Cloud_Optical_Thickness_Log10_Ice'
[Errno group not found: Cloud_Optical_Thickness_Log] 'Cloud_Optical_Thickness_Log'
[Errno group not found: Cloud_Optical_Thickness_Log10_Total] 'Cloud_Optical_Thickness_Log10_Total'
[Errno group not found: Cloud_Effective_Radius] 'Cloud_Effective_Radius'
[Errno group not found: Cloud_Particle_Size_Liquid] 'Cloud_Particle_Size_Liquid'
[Errno group not found: Cloud_Effective_Radius] 'Cloud_Effective_Radius'
[Errno group not found: Cloud_Particle_Size_Ice] 'Cloud_Particle_Size_Ice'
[Errno group not found: Cloud_Water_Path] 'Cloud_Water_Path'
[Errno group not found: Cloud_Water_Path_Liquid] 'Cloud_Water_Path_Liquid'
[Errno group not found: Cloud_Water_Path] 'Cloud_Water_Path'
[Errno group not found: Cloud_Water_Path_Ice] 'Cloud_Water_Path_Ice'
[Errno group not found: COPR_Liquid] 'COPR_Liquid'
[Errno group not found: Cloud_Retrieval_Fraction_Liquid] 'Cloud_Retrieval_Fraction_Liquid'
[Errno group not found: COPR_Ice] 'COPR_Ice'
[Errno group not found: Cloud_Retrieval_Fraction_Ice] 'Cloud_Retrieval_Fraction_Ice'
[Errno group not found: COPR_Combined] 'COPR_Combined'
[Errno group not found: Cloud_Retrieval_Fraction_Total] 'Cloud_Retrieval_Fraction_Total'
Here's the full YAML config:
@cisaacstern This part of the code is supposed to create a copy?
with fsspec.open(make_url(dates[0])) as src:
with open(test_filename, mode="wb") as dst:
dst.write(src.read())
Because the file created is much smaller than the original:
% ls -lt *.nc
-rw-r--r--@ 1 robert staff 47668 Mar 23 15:32 test.nc
-rw-r--r--@ 1 robert staff 40091481 Sep 21 2021 MCD06COSP_M3_MODIS.A2021182.061.2021250210032.nc
Yes, that's the code block which aims to download the file.
How did you get this 40 MB MCD06COSP_M3_MODIS.A2021182.061.2021250210032.nc
?
When I navigate to the GUI at
and click Get as NetCDF 4 the file my web browser downloads is 47668 bytes
➜ Downloads ls -lt *.nc4
-rw-r--r--@ 1 charlesstern staff 47668 Mar 23 12:56 MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc.nc4
which is the same size as the test.nc
retrieved by that code block.
Looks like your 40 MB MCD06COSP_M3_MODIS.A2021182.061.2021250210032.nc
was downloaded last September? Perhaps this Hyrax server is truly just not working right now, as Ryan previously hypothesized?
... hmm on closer reading your file has an updated_at
slug of 2021250210032
whereas somehow I'm pointing at 2020181145824
which is an older version... I'm going to look into that now.
The better comparison would be to
-rw-r--r--@ 1 robert staff 68513011 Mar 1 14:00 MCD06COSP_M3_MODIS.A2021182.061.2022052174444.nc
Thanks for these helpful clarifications re: expected data size, Robert. I've made considerable headway with both file retrieval and a draft of the recipes themselves. Buckle up for a longish but hopefully useful post.
Exploring the LAADS DAAC website a bit turned up the HTTP file service, e.g.
https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MCD06COSP_M3_MODIS/2002/182/
demonstrates a wget
example using the authentication option
wget ... --header "Authorization: Bearer INSERT_DOWNLOAD_TOKEN_HERE"
After generating a token according to these instructions and exporting it as the EARTHDATA_TOKEN
env variable, this authentication style can be adapted to download a complete file via fsspec
as follows
import os
import fsspec
base_url = (
"https://ladsweb.modaps.eosdis.nasa.gov/"
"archive/allData/61/MCD06COSP_M3_MODIS/2002/182"
)
filename = "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"
with fsspec.open(
f"{base_url}/{filename}",
client_kwargs=dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}")),
) as src:
with open(filename, mode="wb") as dst:
dst.write(src.read())
The resulting file is has an openable group for each of the names provided in its ds.YAML_config
import xarray as xr
import yaml
ds = xr.open_dataset(filename)
yaml_config = yaml.safe_load(ds.YAML_config)
group_names = [v["name_out"] for v in yaml_config["variable_settings"]]
has_groups = []
for group in group_names:
try:
ds = xr.open_dataset(filename, group=group)
except OSError as e:
print(e)
else:
has_groups.append(group)
print(has_groups)
['Solar_Zenith', 'Solar_Azimuth', 'Sensor_Zenith', 'Sensor_Azimuth', 'Cloud_Top_Pressure', 'Cloud_Mask_Fraction', 'Cloud_Mask_Fraction_Low', 'Cloud_Mask_Fraction_Mid', 'Cloud_Mask_Fraction_High', 'Cloud_Optical_Thickness_Liquid', 'Cloud_Optical_Thickness_Ice', 'Cloud_Optical_Thickness_Total', 'Cloud_Optical_Thickness_PCL_Total', 'Cloud_Optical_Thickness_Log10_Liquid', 'Cloud_Optical_Thickness_Log10_Ice', 'Cloud_Optical_Thickness_Log10_Total', 'Cloud_Particle_Size_Liquid', 'Cloud_Particle_Size_Ice', 'Cloud_Water_Path_Liquid', 'Cloud_Water_Path_Ice', 'Cloud_Retrieval_Fraction_Liquid', 'Cloud_Retrieval_Fraction_Ice', 'Cloud_Retrieval_Fraction_Total']
With this file access knowledge in hand, we can write a dictionary containing a naive XarrayZarrRecipe
for each group as follows.
Note: Each of these recipes concatenates the given group into a time series spanning all months covered in the
dates
sequence. To make this possible, I define aprocess_input
function which adds the"date"
dimension to each group, because as provided by LAADS DAAC the groups do not have any temporal dimension along which to concatenate.
import os
import pandas as pd
import requests
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
GROUPS = [
'Solar_Zenith',
'Solar_Azimuth',
'Sensor_Zenith',
'Sensor_Azimuth',
'Cloud_Top_Pressure',
'Cloud_Mask_Fraction',
'Cloud_Mask_Fraction_Low',
'Cloud_Mask_Fraction_Mid',
'Cloud_Mask_Fraction_High',
'Cloud_Optical_Thickness_Liquid',
'Cloud_Optical_Thickness_Ice',
'Cloud_Optical_Thickness_Total',
'Cloud_Optical_Thickness_PCL_Total',
'Cloud_Optical_Thickness_Log10_Liquid',
'Cloud_Optical_Thickness_Log10_Ice',
'Cloud_Optical_Thickness_Log10_Total',
'Cloud_Particle_Size_Liquid',
'Cloud_Particle_Size_Ice',
'Cloud_Water_Path_Liquid',
'Cloud_Water_Path_Ice',
'Cloud_Retrieval_Fraction_Liquid',
'Cloud_Retrieval_Fraction_Ice',
'Cloud_Retrieval_Fraction_Total',
]
BASE_URL = "https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MCD06COSP_M3_MODIS"
dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS") # "MS" for "month start"
concat_dim = ConcatDim("date", keys=dates, nitems_per_file=1)
def make_url(date):
"""Make a NetCDF4 download url for NASA MODIS-COSP data based on an input date.
:param date: A member of the ``pandas.core.indexes.datetimes.DatetimeIndex``
created with ``dates = pd.date_range("2002-07-01", "2021-07-01", freq="MS")``.
"""
day_of_year = date.timetuple().tm_yday
response = requests.get(f"{BASE_URL}/{date.year}/{day_of_year}.json")
filename = [r["name"] for r in response.json()].pop(0)
return f"{BASE_URL}/{date.year}/{day_of_year}/{filename}"
pattern = FilePattern(
make_url,
concat_dim,
fsspec_open_kwargs={
"client_kwargs": dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}"))
},
)
def process_input(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
"""
import xarray as xr
return xr.concat([ds], dim="date")
per_group_recipes = {
group: XarrayZarrRecipe(
pattern,
xarray_open_kwargs=dict(group=group),
process_input=process_input,
)
for group in GROUPS
}
We cannot execute these recipes on Pangeo Forge Cloud yet, because we don't yet have a mechanism to securely manage credentials (xref https://github.com/pangeo-forge/roadmap/pull/36). However, I did execute a 2-month temporal subset of each of these recipes locally (and anyone else can too) with the following code:
NOTE: Running the code below will create 23 new subdirectories (i.e. Zarr stores, which are directories) within the current working directory.
from fsspec.implementations.local import LocalFileSystem
from pangeo_forge_recipes.recipes import setup_logging
from pangeo_forge_recipes.storage import CacheFSSpecTarget, FSSpecTarget
fs_local = LocalFileSystem()
setup_logging("DEBUG")
for group_name, recipe in per_group_recipes.items():
print(f"\n\n Building {group_name} onto local storage...")
recipe.storage_config.cache = CacheFSSpecTarget(fs_local, "cache")
recipe.storage_config.target = FSSpecTarget(fs_local, group_name + ".zarr")
recipe_pruned = recipe.copy_pruned()
recipe_pruned.to_function()()
and the resulting Zarr stores (one for each group) can be accessed with
import xarray as xr
ds = xr.open_zarr(f"{group_name}.zarr", consolidated=True)
by way of conclusion, for now, based on this test I'd estimate the full temporal scope of each of these recipes to build Zarr stores of between ~ 1.1 and 12.8 GB per group, with a total dataset (consisting of a full temporal run for each of the 23 groups) size of about 69 GB:
all_groups_full_size = 0
for group in GROUPS:
ds = xr.open_zarr(f"{group}.zarr", consolidated=True)
group_pruned_size = round(ds.nbytes/1e6)
group_full_size = group_pruned_size * len(dates)
print(f"{group} {group_pruned_size} MB -> {group_full_size/1e3} GB")
all_groups_full_size += group_full_size
print(f"\n{all_groups_full_size/1e3} GB")
Solar_Zenith 5 MB -> 1.145 GB
Solar_Azimuth 5 MB -> 1.145 GB
Sensor_Zenith 5 MB -> 1.145 GB
Sensor_Azimuth 5 MB -> 1.145 GB
Cloud_Top_Pressure 5 MB -> 1.145 GB
Cloud_Mask_Fraction 5 MB -> 1.145 GB
Cloud_Mask_Fraction_Low 5 MB -> 1.145 GB
Cloud_Mask_Fraction_Mid 5 MB -> 1.145 GB
Cloud_Mask_Fraction_High 5 MB -> 1.145 GB
Cloud_Optical_Thickness_Liquid 49 MB -> 11.221 GB
Cloud_Optical_Thickness_Ice 49 MB -> 11.221 GB
Cloud_Optical_Thickness_Total 56 MB -> 12.824 GB
Cloud_Optical_Thickness_PCL_Total 51 MB -> 11.679 GB
Cloud_Optical_Thickness_Log10_Liquid 5 MB -> 1.145 GB
Cloud_Optical_Thickness_Log10_Ice 5 MB -> 1.145 GB
Cloud_Optical_Thickness_Log10_Total 5 MB -> 1.145 GB
Cloud_Particle_Size_Liquid 5 MB -> 1.145 GB
Cloud_Particle_Size_Ice 5 MB -> 1.145 GB
Cloud_Water_Path_Liquid 5 MB -> 1.145 GB
Cloud_Water_Path_Ice 5 MB -> 1.145 GB
Cloud_Retrieval_Fraction_Liquid 5 MB -> 1.145 GB
Cloud_Retrieval_Fraction_Ice 5 MB -> 1.145 GB
Cloud_Retrieval_Fraction_Total 5 MB -> 1.145 GB
68.7 GB
@cisaacstern Thanks so much for continuing to work on this; it's spectacular.
I'm not sure how y'all think of things at Pangeo-forge but, from a science user's perspective, there's a lot to be gained by more targeted processing. (By way of background, for some groups we want to extract only one field of four; for other groups we want to do some arithmetic on existing fields.)
My understanding is that I should create a set of dictionary containing a set of XarrayZaarRecipies
, where each process_input
keyword points to the appropriate function? For example, I might have extract_selected_fields
which creates a dataset from the Mean
variable from a set of groups (renamed to the group name, so Cloud_Top_Pressure.Mean
becomes Cloud_Top_Pressure
)? And the recipes that share input files will not download the files over and over?
Is there a way to handle appending new data as it is produced, month by month?
Question: do these groups contain variables with the same dimensions / coordinates? If so, it would make sense logically to merge them into a single dataset. (That is not possible today but would become possible with the Opener refactor.)
All variables share location and time coordinates. I would package all the scalar fields together in a single dataset. There are also some joint histograms with the same location and time coordinates but different histogram bins. Because they don't share bin definitions, and because they're large, I had though to create separate datasets for each unique set of bin definitions.
There is no inhenernt size limit to the zarr group, because it is not a single file. It's all about doing whatever is most convenient for the person analyzing the data. In this case, it sounds like we want just one big dataset.
As long as the dimensions use distinct names, we should be fine to merge into a single dataset. I.e. bins: 50
and bins: 70
would cause merge errors, but Cloud_Water_Path_Liquid_bins: 50
and Cloud_Retrieval_Fraction_Ice_bins: 70
would be fine.
We cannot execute these recipes on Pangeo Forge Cloud yet, because we don't yet have a mechanism to securely manage credentials
Charles, I wonder if it is worthwhile to just special case earthdata login and inject some earthdata login credentials directly into our environments. This would allow us to move forward with some of these recipes before we solve the general secrets problem.
Yes, merging is definitely the way to go. As Ryan said, we'll need https://github.com/pangeo-forge/pangeo-forge-recipes/pull/245 to do this in a single recipe, but we can do it today in two steps, which I've done to complete the end-to-end demonstration.
I exported the outputs of each of the recipes in my last comment with ds.to_netcdf
and cached those files to our OSN bucket at these publicly accessible paths:
I wrote a second recipe to merge these inputs into a single Zarr store:
from pangeo_forge_recipes.patterns import ConcatDim, FilePattern, MergeDim
from pangeo_forge_recipes.recipes import XarrayZarrRecipe
concat_dim = ConcatDim("date", keys=[0,], nitems_per_file=2)
# Here `GROUPS` is the list defined in:
# https://github.com/pangeo-forge/staged-recipes/issues/125#issuecomment-1077053600
merge_dim = MergeDim("group", keys=GROUPS)
def make_url(date, group):
base_url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge"
return f"{base_url}/modis-cosp/cache/{group}.nc"
def process_input(ds, filename):
"""Add a group name abbreviation to each data variable name.
"""
group = filename.split("/modis-cosp/cache/")[-1].replace(".nc", "")
abbreviation = (
"".join([word[0] for word in group.split("_")]) # e.g. 'Cloud_Top_Pressure' -> 'CTP'
if not group.startswith("S") # special casing to disambiguate 'Solar_*' & 'Sensor_*'
else group[:3] + group.split("_")[-1][0] # e.g. 'Solar_Zenith' -> 'SolZ'; 'Sensor_Zenith' -> 'SenZ'
)
return ds.rename_vars({v: f"{abbreviation}_{v}" for v in ds.data_vars})
pattern = FilePattern(make_url, concat_dim, merge_dim)
recipe = XarrayZarrRecipe(pattern, process_input=process_input)
I ran this recipe locally and then manually copied the output to our OSN bucket. The resulting Zarr store (2 time steps, 114 data variables) can be opened with:
import fsspec
import xarray as xr
base_url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge"
dataset_public_url = f"{base_url}/modis-cosp/modis-cosp-demo.zarr"
mapper = fsspec.get_mapper(dataset_public_url)
ds = xr.open_zarr(mapper, consolidated=True)
print(ds)
I'll respond to the other questions/comments in another comment.
here's a lot to be gained by more targeted processing. ... My understanding is that I should create a set of dictionary containing a set of XarrayZarrRecipes, where each process_input keyword points to the appropriate function?
Correct. As described in the API Reference, process_input
functions must have the signature
def process_input(ds: xr.Dataset, filename: str) -> ds: xr.Dataset
so to use the group name (for renaming variables, etc.) within process_input
, you'll need to get it either from the filename (as I did above) or perhaps ds.attrs["long_name"]
. And yes, you can apply any arithmetic, etc. within this function as well, and then just return the ds
as you'd like it to appear in the recipe's output dataset.
I agree that a great next step would be for you to refine the per-group recipes I prototyped in my earlier comment so that the per-group Zarr stores they output look as you'd like them to. (Merging all these together will become a lot simpler once the above-referenced refactor is complete, so we don't need to worry about that for now.)
As you go along, you can run local tests of your recipes as described in the Running a Recipe Locally docs. Once you hit a point where you have questions, rather than posting your code in comments as I've done here, I'd recommend Making a PR, which will make it easier for me to clone and work with your code.
And the recipes that share input files will not download the files over and over?
Once we've put everything together into one recipe, yes this will be true. In the interim, while we still have a single recipe for each group, that won't happen automatically, because each recipe maintains its own cache. If you get to a point where this becomes a barrier to recipe development, just let me know and I can show you some advanced config to point all of the recipes to a single cache. I'd recommend trying to execute a few recipes first before we get into that, though.
Is there a way to handle appending new data as it is produced, month by month?
This is on the roadmap (xref https://github.com/pangeo-forge/pangeo-forge-recipes/issues/37) but for now the solution for this would be to just overwrite the original dataset with an updated date range once new data is released. For this particular dataset, that does not concern me too much, because the entire dataset is less than a 100 GB, which is on the low end of what our infrastructure is designed to handle, so re-writing the whole thing should be relatively fast (a few hours, maybe).
I wonder if it is worthwhile to just special case earthdata login and inject some earthdata login credentials directly into our environments.
Yes, that's a good idea. And we may end up wanting to the same for other commonly used portals.
I'm not sure how y'all think of things at Pangeo-forge but, from a science user's perspective, there's a lot to be gained by more targeted processing.
Last comment for now but wanted to add this because I realized I did not answer the aesthetic dimension of this question. The aim of Pangeo Forge is to produced analysis-ready, cloud-optimized (ARCO) datasets. The XarrayZarrRecipe
will take care of the cloud-optimized part, but as the domain expert, we defer to you for the analysis-ready part. You should absolutely apply whatever preprocessing will make this data a dream to work with, and which will help you and other scientists minimize, or even eliminate, the latency between opening this dataset and getting started on your/their science. Our ideal world is one in which you open this dataset and breathe a sign of relief, "Ah, what a relief, this dataset is ready to go!"
@cisaacstern I've cloned this repo and started work on my recipe, building on your generous help. A couple questions arising:
As you note the signature for process_inputs
is process_input(ds: xr.Dataset, filename: str) -> ds: xr.Dataset
. My understanding is that ds
is the results of ds = xr.open_dataset(filename, **client_kwargs)
. Is that correct? If so, I guess it's ok to make other calls to xr.open_dataset()
with different arguments within the body of process_inputs()
?
What is the preferred way at present to loop over a collection of recipes, as you do here, in the current environment?
Related: is it ok to have a recipe repo contain several recipes?
In general I would not recommend calling open_dataset
from within the preprocessing function. Although I can see how that hack would be a useful hack for us to get around the fact that we cannot distinguish between different groups at the FilePattern
level. So perhaps we do it for now and then refactor later once https://github.com/pangeo-forge/pangeo-forge-recipes/pull/245 is done.
- is it ok to have a recipe repo contain several recipes?
Yes. They just have to be enumerate in meta.yaml.
Thanks for your patience with this Rob. It's very helpful for us to have willing guinea pigs. 🐹
What is the preferred way at present to loop over a collection of recipes, as you do https://github.com/pangeo-forge/staged-recipes/issues/125#issuecomment-1077053600, in the current environment?
Everything in that linked comment should work as-is with the current release of pangeo-forge-recipes
.
As I show there, generally I've found the most concise way to define a number of recipes with some overlapping kwargs and some unique kwargs is with a dictionary comprehension. But you can also just write them out, "long-hand", one at a time, which is more verbose but has the benefit of being more easily (human) readable.
For test execution of a collection of recipes, the code in that same linked comment should also work as-is, but certainly let me know if you find otherwise.
@cisaacstern I'm coming back to this project and now have a condo environment that includes the pangeo-forge package. I'm a little unclear how the pieces of code are supposed to fit together. Looking at the other examples in this repo, it seems that recipe.py
defines a single recipe that will eventually be executed with recipe.to_function()()
. Your comment above goes beyond this, to define a dictionary per_group_recipes
with each item being a recipe. You then execute in a loop over the dictionary elements. How would I arrange e.g. recipe.py
to do this loop? I realize I could create a separate Python recipe file for each group but that seems like the long way round.
@RobertPincus, glad you're working on it! And thanks for the question.
Your recipe.py
can start from a copy-and-paste of the code block which defines the per_group_recipes
in the comment you link. In the same comment, immediately below the block which defines the per_group_recipes
, there are another two code blocks which show how to run a looped local test execution of the per_group_recipes
dictionary and then open the resulting zarr stores with xarray. These three blocks in the linked comment which respectively
per_group_recipes
should all work as-is in a new conda environment with pangeo-forge-recipes
installed from conda, which it sounds like you have. If you find that's not the case, certainly let me know.
Once you've replicated those three steps with copy-and-paste, you can then start editing the recipes defined in the per_group_recipes
dictionary and re-running the execution test loop to get the resulting zarr stores to look the way you want them to.
Dear @cisaacstern What you describe is how I'm doing the development - I did copy/paste your code, and now I'm modifying it as needed (and making mistakes, which I might ask you for help with).
But once I have things working in this test environment, how will I configure the repo to work as a feedstock, which I understand iterates over the recipe
defined in meta.yaml
?
A per-recipe issue: as you've seen most data in the source files is organized within netcdf groups, ie. group "Solar_Zenith" has variables "Mean", "Pixel_Counts" etc. Datasets produced from the recipe(s) would ideally contain data from within a group (the mean value, for example) and data that doesn't belong to any group (latitude and longitude, maybe also attributes). I was hoping to address this by opening the dataset referred by the filename
argument to process_inputs
:
def extract_mean(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
Extract mean values and transpose latitude, longitude dimension
"""
import xarray as xr
#
# ds contains all the variables in the group including joint histograms
# for this recipe we want only the variable Mean, transposed, and renamed to the group name
#
# New dataset with group attributes
newds = xr.Dataset(attrs = ds.attrs)
# Add global attributes - needed?
newds.attrs.update(xr.open_dataset(filename, engine="netcdf4").attrs)
# Discover the name of the netcdf group that ds contains.
# There might be more robust ways to do this
groupname = ds.Mean.attrs["title"].replace(": Mean","")
newds[groupname] = ds.Mean.T.rename(groupname)
#
# When accessing a group the lat and lon variables are indexes, not numerical values
#
newds["latitude"] = xr.open_dataset(filename, engine="netcdf4").latitude
newds["longitude"] = xr.open_dataset(filename, engine="netcdf4").longitude
return xr.concat([ds], dim="date")
But it seems like the filename
being passed to the process_input
argument points to the remote file (not the local cache), and the files aren't being served correctly by the OpenDAP server as you know.
Is there a way to point to the locally-cached copy of the data instead? How else might I accomplish what I'm after?
But once I have things working in this test environment, how will I configure the repo to work as a feedstock, which I understand iterates over the recipe defined in meta.yaml?
We certainly need to improve our documentation of the meta.yaml
recipes section, which is currently limited to a brief discussion in the docs here as well as some inline commentary in the Pangeo Forge Sandbox template meta.yaml
here.
As you may have noted, neither of those sources currently document the dictionary option, which I'll address below, but before doing so will first briefly suggest how you could use the conventional style for your case, in the event it's of interest.
I've copied the relevant section from the linked template into this comment for ease of reference. As noted here, the recipes section of meta.yaml
is not just a pointer to the recipe file, but the specific object name within that file:
recipes:
# User chosen name for recipe. Likely similiar to dataset name, ~25 characters in length
- id: identifier-for-your-recipe
# The `object` below tells Pangeo Cloud specifically where your recipe instance(s) are located and uses the format <filename>:<object_name>
# <filename> is name of .py file where the Python recipe object is defined.
# For example, if <filename> is given as "recipe", Pangeo Cloud will expect a file named `recipe.py` to exist in your PR.
# <object_name> is the name of the recipe object (i.e. Python class instance) _within_ the specified file.
# For example, if you have defined `recipe = XarrayZarrRecipe(...)` within a file named `recipe.py`, then your `object` below would be `"recipe:recipe"`
object: "recipe:recipe"
So if you were to not use the dictionary approach I've previously encouraged, and instead just define a number of recipes in a single recipe.py
such as:
# imports etc.
# define file pattern etc.
cloud_top_pressure_recipe = XarrayZarrRecipe( # .... )
cloud_mask_fraction_recipe = XarrayZarrRecipe( # .... )
cloud_optical_thickness_liquid_recipe = XarrayZarrRecipe( # .... )
# etc.
Then following the model described in the template above, the recipes section of your meta.yaml
could be:
recipes:
- id: cloud-top-pressure
object: "recipe:cloud_top_pressure_recipe"
- id: cloud-mask-fraction
object: "recipe:cloud_mask_fraction_recipe"
- id: cloud-optical-thickness
object: "recipe:cloud_optical_thickness_liquid_recipe"
# etc. as many of these as you want
This then tells Pangeo Forge Cloud that there is a recipe with ID cloud-top-pressure
which exists as a Python object named cloud_top_pressure_recipe
that is defined in a file named recipe.py
, etc.
If you don't mind writing the recipes out "long hand" so to speak (i.e. assigning each one to a distinct variable name in recipe.py
), this style will work well.
The dictionary approach can be a bit more concise to write, which is why I previously demonstrated it. If you find you prefer that style, as mentioned above the meta.yaml
description for it actually has not made it into the user-facing docs yet, but is documented in our design roadmap here. Copying from that source, if you define a dictionary of recipes called per_group_recipes
in a file named recipe.py
, your recipes section can read:
recipes:
- dict_object: "recipe:per_group_recipes"
Thanks again for your patient questions as we bring the documentation up to speed.
Being able to process a dictionary of recipes is excellent. Thanks, I'll do that.
Is there a way to point to the locally-cached copy of the data instead? How else might I accomplish what I'm after?
Not sure if @rabernat will have another suggestion but to start, one option would be to adapt the local download code first posted in this comment to cache a single file at the top of your recipe (outside the process_input
function), open it just once, and use it to assign whatever variables/attributes you'd like within process_input
:
import os
import fsspec
# pangeo_forge_recipes imports here
base_url = (
"https://ladsweb.modaps.eosdis.nasa.gov/"
"archive/allData/61/MCD06COSP_M3_MODIS/2002/182"
)
# define FilePattern here
# you could also get this directly from the FilePattern, instead of hardcoding it
reference_filename = "MCD06COSP_M3_MODIS.A2002182.061.2020181145824.nc"
with fsspec.open(
f"{base_url}/{reference_filename}",
client_kwargs=dict(headers=dict(Authorization=f"Bearer {os.environ['EARTHDATA_TOKEN']}")),
) as src:
with open(reference_filename, mode="wb") as dst:
dst.write(src.read())
reference_ds = xr.open_dataset(reference_filename, engine="netcdf4")
def extract_mean(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
Extract mean values and transpose latitude, longitude dimension
"""
import xarray as xr
# New dataset with group attributes
newds = xr.Dataset(attrs = ds.attrs)
# Add global attributes - needed?
newds.attrs.update(reference_ds.attrs)
# Discover the name of the netcdf group that ds contains.
# There might be more robust ways to do this
groupname = ds.Mean.attrs["title"].replace(": Mean","")
newds[groupname] = ds.Mean.T.rename(groupname)
#
# When accessing a group the lat and lon variables are indexes, not numerical values
#
newds["latitude"] = reference_ds.latitude
newds["longitude"] = reference_ds.longitude
return xr.concat([ds], dim="date")
I think this is more reliable than trying to access the files cached by the executor dynamically, but assumes that a file representing just one time step will have the correct latitude, longitude, and attributes for all time steps. Is that a fair assumption?
@cisaacstern It is a far assumption that "one time step will have the correct latitude, longitude, and attributes for all time steps" Thanks for this idea - I've also been exploring the idea of adding latitude and longitude after the fact, extracting the fields outside of the groups and adding them later.
As a technical matter, does code written in "recipe.py" get executed outside the executor?
Gotcha.
As a technical matter, does code written in "recipe.py" get executed outside the executor?
No, all code in recipe.py
will be executed within the executor, which will likely be the function executor for your local testing (unless you decide to experiment outside what I proposed above).
Just thinking aloud, the difficulty with accessing the recipe cache at runtime is that process_input
is a pure function that does not have any awareness of the runtime attributes (i.e. state) of the recipe. But the cache location is a dynamically-assigned attribute of the recipe. So there is no way to know the recipe's cache location before the recipe is executed.
How can I extract an arbitrary file from the pattern
for caching?
I thought I would try to extract the minimal amount of data from each group and add the coordinates later:
def extract_mean(ds, filename):
"""Add missing "date" dimension to dataset to facilitate concatenation.
Extract mean values and transpose latitude, longitude dimension
Second iteration - don't construct datasets, simplly return renamed values
"""
import xarray as xr
#
# ds contains all the variables in the group including joint histograms
# for this recipe we want only the variable Mean, transposed, and renamed to the group name
#
# Discover the name of the netcdf group that ds contains.
# There might be more robust ways to do this
groupname = ds.Mean.attrs["title"].replace(": Mean","")
return xr.concat([xr.Dataset(data_vars = {groupname:ds.Mean.T}), dim="date")
This fails for reasons I don't understand
dask.array<xarray-<this-array>, shape=(1, 180, 360), dtype=float64, chunksize=(1, 180, 360), chunktype=numpy.ndarray>
Dimensions without coordinates: date, latitude, longitude
Attributes:
title: Solar_Zenith: Mean
units: degrees
Traceback (most recent call last):
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/mapping.py", line 135, in __getitem__
result = self.fs.cat(k)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py", line 739, in cat
return self.cat_file(paths[0], **kwargs)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py", line 649, in cat_file
with self.open(path, "rb", **kwargs) as f:
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/spec.py", line 1009, in open
f = self._open(
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py", line 155, in _open
return LocalFileOpener(path, mode, fs=self, **kwargs)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py", line 250, in __init__
self._open()
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/implementations/local.py", line 255, in _open
self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/data/robertp/MODIS-COSP/Solar_Zenith.zarr/.zmetadata'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py", line 348, in open_group
zarr_group = zarr.open_consolidated(store, **open_kwargs)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/zarr/convenience.py", line 1187, in open_consolidated
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/zarr/storage.py", line 2644, in __init__
meta = json_loads(store[metadata_key])
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/zarr/storage.py", line 545, in __getitem__
return self._mutable_mapping[key]
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/fsspec/mapping.py", line 139, in __getitem__
raise KeyError(key)
KeyError: '.zmetadata'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 442, in prepare_target
ds = open_target(config.storage_config.target)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 111, in open_target
return xr.open_zarr(target.get_mapper())
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py", line 752, in open_zarr
ds = open_dataset(
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/backends/api.py", line 495, in open_dataset
backend_ds = backend.open_dataset(
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py", line 800, in open_dataset
store = ZarrStore.open_group(
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/backends/zarr.py", line 365, in open_group
zarr_group = zarr.open_group(store, **open_kwargs)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/zarr/hierarchy.py", line 1182, in open_group
raise GroupNotFoundError(path)
zarr.errors.GroupNotFoundError: group not found at path ''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 6, in <module>
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/pangeo_forge_recipes/executors/python.py", line 46, in function
stage.function(config=pipeline.config)
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 468, in prepare_target
for v in ds.variables:
File "/home/robertp/.conda/envs/conda-forge-recipes/lib/python3.9/site-packages/xarray/core/common.py", line 239, in __getattr__
raise AttributeError(
AttributeError: 'DataArray' object has no attribute 'variables'
And ideas? extract_mean
specifically returns an xr.Dataset
How can I extract an arbitrary file from the pattern for caching?
With the caveat that this is definitely a bit hacky, using the pattern
defined in https://github.com/pangeo-forge/staged-recipes/issues/125#issuecomment-1077053600 you could do:
def get_url_for_nth_input(n):
input_key = [key for key in pattern if tuple(key)[0].index == n].pop(0)
url = pattern[input_key]
return url
Where the argument n
is the index number in the concatenation dimension of the input, so
get_url_for_nth_input(n=2)
'https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MCD06COSP_M3_MODIS/2002/244/MCD06COSP_M3_MODIS.A2002244.061.2020181145835.nc'
and
get_url_for_nth_input(n=50)
'https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/61/MCD06COSP_M3_MODIS/2006/244/MCD06COSP_M3_MODIS.A2006244.061.2020181150014.nc'
This fails for reasons I don't understand ... And ideas? extract_mean specifically returns an xr.Dataset
Perhaps more verbose than what you've tried, but rather than building up a new xr.Dataset
from ds
, perhaps try dropping everything unwanted from the original ds
?
Edited: Had a typo in function (and example outputs) when I first posted this comment. Fixed now.
@cisaacstern Thanks for the feedback. I fixed the problem of returning a Dataset rather than a Datarray by only renaming the field once :-).
So how to extract the coordinates? In the comment above I asked about extracting a single file name from the pattern, but that's maybe too specific a question. Really, what I want to do is to identify any single instance of the files defined by my pattern,
open this, and extract the date-independent coordinate variables. In reading the tutorials it seems like this isn't quite consistent with the condo-forge FilePattern
because we don't have a Combine Dimension
but maybe I miss something?
It's getting a bit hard for me to follow what particular code we are talking about at this point. Could you open a PR against this repo with your latest recipe.py
including in-line comments explaining where you want to extract coordinates? (I'm assuming this will be in process_input
but this is getting hard for me to picture without looking at the exact code.)
You can just omit meta.yaml
from your PR for now, which will confuse @pangeo-forge-bot, but that's okay.
@cisaacstern My code is very much a work in progress and indeed, the data I'm planning to apply this to won't be ready for several weeks. Does it help to point you to my fork, where the recipes live in recipes/nasa-modis-cosp
?
At this stage it seems like my process is going to be
A few questions as I iterate on the recipes:
FilePattern
use a local store? In step 3 above I tried to point to the local results from step 1 but this failed (and I see you moved everything to OSN to do the merge). extract_jhist_vs_pc()
I tried to create data variables from attributes of the joint histogram variable but these are appearing as coordinates, although no other variable uses them as coordinates. Any ideas why that might be? Can FilePattern use a local store? In step 3 above I tried to point to the local results from step 1 but this failed (and I see you moved everything to OSN to do the merge).
Yes, FilePattern can point at local files. I only moved everything to OSN so that you (or others) could interact with the output of my prototyped recipe, not as a requirement to make it run. Can you provide a link to the code which threw the error you reference, along with the full error traceback?
In extract_jhist_vs_pc() I tried to create data variables from attributes of the joint histogram variable but these are appearing as coordinates, although no other variable uses them as coordinates. Any ideas why that might be?
A minimal reproducible example would be helpful for debugging this. Could you provide a self-contained code snippet (presumably using only xarray
and maybe fsspec
, but not pangeo-forge-recipes
) which reproduces the problem you describe here?
@cisaacstern Sorry for the radio silence. In the interests of having file-organizing code I can publish with a paper describing the data I've written a stand-alone Python script that manages the opening of different netCDF groups as needed.
I'd still like to make the data available through Pangeo Forge, though. I was speaking to @tomnicholas earlier today; it sounds like I could do this easily if the recipes allowed me to use datatree instead of xarray datasets. Could I do this?
@RobertPincus, thanks for the ongoing attention to this. As Tom has just posted, I believe the linked Opener refactor will be necessary to make this a reality: https://github.com/pangeo-forge/pangeo-forge-recipes/pull/245#issuecomment-1109075727
@cisaacstern Thanks, I had seen the note from @TomNicholas. I will be delighted to use the revised Opener when it becomes available.
Source Dataset
This data provides satellite observations of targeted at the evaluation of global models, facilitated with the use of synthetic observations ("satellite simulators"). The data are a re-packaging of standard "Level-3" (gridded, aggregated) cloud products produced by NASA's MODIS satellites; data from both instruments (morning/Terra and afternoon/Aqua) are combined. The fields conform to output from the "MODIS simulator," one of several used in the CFMIP Observation Simulator Package (COSP, paper1, paper2). Output from COSP and the MODIS simulator is requested as part of the Cloud Feedbacks Model Intercomparison Project (CFMIP), part of CMIP.
Transformation / Alignment / Merging
Ideally we would provide several related datasets. One would contain the mean values for many or even all scalar fields. This means extracting the mean value from each group from each file and concatenating the fields in time. A second would be the joint histograms, which need to be extracted and normalized, metadata refined, and also concatenated in time. Since the joint histograms are roughly 50 times as large as each scalar field it may be best to create one dataset per joint histogram (there are about a dozen).
Output Dataset
Given the user community it would be useful to produce netCDF output, perhaps with a kerchunk index. It would be fine to also produce a Zarr or related dataset, perhaps in addition. Whatever the format, the data should be structured so that it's easy to append new data as it is produced.