netCDF files (and conversion to a cloud-optimized format)

m-mohr commented 2 years ago

netCDF v4 according to docs and GDAL (CF v1.7)
Panoply and rasterio report Hierarchical Data Format, version 5
QGIS can't read these files!
gdalinfo can't read it, you need to use gdalmdiminfo or ogrinfo instead => vestor data cube with point-data

This product does not conform to the netCDF classic data model because it makes use of multiple unlimited dimensions.

Describes unlimited dimensions: https://docs.unidata.ucar.edu/netcdf-c/current/unlimited_dims.html

Global attributes such as id, featureType, datetimes etc. Can be used for collection and item metadata. See #4 for details.

It has a lot of variables, see Table 5.26.6-2. Each variable has only one dimension though. They are grouped as follows:

Dim number_of_flashes (flash_count: 179)
- flash_id
- flash_time_offset_of_first_event
- flash_time_offset_of_last_event
- flash_frame_time_offset_of_first_event
- flash_frame_time_offset_of_last_event
- flash_lat
- flash_lon
- flash_area
- flash_energy
- flash_quality_flag
Dim number_of_groups (group_count: 3706)
- group_id
- group_time_offset
- group_frame_time_offset
- group_lat
- group_lon
- group_area
- group_energy
- group_parent_flash_id
- group_quality_flag
Dim number_of_events (event_count: 11236)
- event_id
- event_time_offset
- event_lat
- event_lon
- event_energy
- event_parent_group_id
Dim number_of_time_bounds
- product_time_bounds (662731180,662731200.469121) -> seconds since 2000-01-01 12:00:00
Dim number_of_field_of_view_bounds -> metadata, i.e. the bounding box
- lat_field_of_view_bounds (66.56,-66.56)
- lon_field_of_view_bounds (-141.56,-8.44)
Dim number_of_wavelength_bounds
- lightning_wavelength_bounds (776.87,777.87) -> FWHM

Additional variables with a scalar value, but additional metadata assigned:

algorithm_dynamic_input_data_container (invalid value, seems mostly about metadata)
algorithm_product_version_container (invalid value, seems mostly about metadata)
event_count ^
flash_count ^
flash_time_threshold (3,329999924)
goes_lat_lon_projection (invalid value, seems mostly about metadata)
group_count ^
group_time_threshold (0,0)
lat_field_of_view (0) / lon_field_of_view (-75,0) -> metadata, i.e. the center coordinate
lightning_wavelength (777,3699951)
nominal_satellite_height (35786,02344)
nominal_satellite_subpoint_lat (0,0)
nominal_satellite_subpoint_lon (-75,19999695)
percent_navigated_L1b_events (1,0)
percent_uncorrectable_L0_errors (0,0)
processing_parm_version_container (invalid value, seems mostly about metadata)
product_time (662731180,0) -> seconds since 2000-01-01 12:00:00
yaw_flip_flag (0)

Example in brackets are from the G16 example file.

Potential solution:

provide original netCDF
provide 3 (or 1?) geoparquet files (flashes, groups, events)
add (all?) additional data into STAC metadata

The main question here is what people actually use? I'm not a domain expert here and unsure about how people would use this.

Direct conversion is not possible, we need to rewrite this into columnar format:

gdalmdimtranslate to GTIFF -> Sementation fault
ogr2ogr to Parquet -> Can't read netCDF

TomAugspurger commented 2 years ago

A few brief thoughts, using this snippet:

```python import planetary_computer import fsspec import adlfs import xarray as xr import geopandas import pyproj import pandas as pd from typing import Tuple def to_dataframes(ds: xr.DataArray) -> Tuple[xr.Dataset, xr.Dataset, xr.Dataset]: flash_vars = [k for k, v in ds.data_vars.items() if "number_of_flashes" in v.dims] groups_vars = [k for k, v in ds.data_vars.items() if "number_of_groups" in v.dims] events_vars = [k for k, v in ds.data_vars.items() if "number_of_events" in v.dims] crs = pyproj.CRS.from_cf(ds.goes_lat_lon_projection.attrs) flashes = ds[flash_vars].to_dataframe() flashes = geopandas.GeoDataFrame( flashes, geometry=geopandas.array.points_from_xy(flashes.flash_lon, flashes.flash_lat), crs=crs ) groups = ds[groups_vars].to_dataframe() groups = geopandas.GeoDataFrame( groups, geometry=geopandas.array.points_from_xy(groups.group_lon, groups.group_lat), crs=crs, ) events = ds[events_vars].to_dataframe() events = geopandas.GeoDataFrame( events, geometry=geopandas.array.points_from_xy(events.event_lon, events.event_lat), crs=crs, ) return flashes, groups, events credential = planetary_computer.sas.get_token("goeseuwest", "noaa-goes16").token fs = adlfs.AzureBlobFileSystem("goeseuwest", credential=credential) files = fs.find("noaa-goes16/GLM-L2-LCFA/2022/001/") dataframes = [] # first 5 for testing. # Need to figure out a partitioning / row-group strategy for file in files[:5]: f = fs.open(file) ds = xr.open_dataset(f, engine="h5netcdf") dataframes.append(to_dataframes(ds)) a, b, c = zip(*dataframes) # ignore_index drops the *_count variable, which I think is OK # If we want to keep that (for mapping back to the original file) # we should reset_index() and include the file path as a new column flashes = pd.concat(a, ignore_index=True) groups = pd.concat(b, ignore_index=True) events = pd.concat(c, ignore_index=True) ```

Generally +1 on converting to (multiple) GeoParquet files.
+1 to dropping the redundant prefixes from the tables.
I think we have some work to do figuring out how to partition the files (by time, but how frequently? do we use row groups? etc.).
My snippet just sets the geometry as a series of Points, based on the lon / lat. I don't know if we can use the "area" column to make a polygon (probably not).

One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).

m-mohr commented 2 years ago

Thanks for the thoughts, @TomAugspurger.

I think we have some work to do figuring out how to partition the files (by time, but how frequently? do we use row groups? etc.).

So the netCDF files are partitioned into intervals already, I think 20 seconds or so. If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.

If there's a desire to re-group multiple netCDF files into a single set of 3 parquet files, we can't really expose the netCDF files with them in a way that makes sense in STAC as then you'd have that don't cover the full temporal extent that is available. It would be somewhat possible, but doesn't feel STACish. Or you don't expose the netCDF files or generate two STAC Items, one for netCDF one for parquet file.

Not sure what the best way forward is? The first option would certainly best align with existing implementations, I think.

My snippet just sets the geometry as a series of Points, based on the lon / lat. I don't know if we can use the "area" column to make a polygon (probably not).

My implementation does the same and I think that is correct. Computing a polygon from the area seems like a wild guess. How many edges would that polygon have? Someone could also arque to use a circle or so...

One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).

Good catch. Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats. The question here is whether we should convert them to something that is more commonly used? Either UNIX_TIMESTAMP (i.e. since 1970) or ISO timestamps?

TomAugspurger commented 2 years ago

If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.

I completely forgot to think through that, and I agree that from a STAC point of view it makes sense to have a single Item pointing to the NetCDF and three parquet files.

This will make timeseries analysis a bit harder since you'll have to work with very many files, but I think that can be overcome with some effort.

Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats.

I see. I was misled by xarray decoding those integers as datetimes. If it isn't too much additional work then yes, I think it would be useful to convert them to timestamps.

m-mohr commented 2 years ago

@TomAugspurger Do you have a clue how I can set version to 2.6? I'm using geopandas.GeoDataFrame.to_parquet, but as it already has a version parameter for the geoparquet version, I can't use kwargs to pass through the version to arrow?

kwargs = {"version":"2.6"}
dataframe.to_parquet(file, **kwargs)
# => ValueError: version must be one of: 0.1.0, 0.4.0

TomAugspurger commented 2 years ago

Ah that's unfortunate. GeoDataFrame.to_parquet's version is used for the geoparquet version, and clashes with the version argument used by Table.to_parquet: https://github.com/geopandas/geopandas/blob/28462f8e7fd893e01be52f444e142a1a00c36f8e/geopandas/io/arrow.py#L284

I'll open an issue.

m-mohr commented 2 years ago

Thank you, was about to do the same but please go ahead @TomAugspurger .

How do we proceed? I have code ready that exports timestamps, the only issue now is that it is not precise enough and we'd likely need to wait for a fix?

TomAugspurger commented 2 years ago

https://github.com/geopandas/geopandas/issues/2495.

I included a workaround there, that uses a private geopandas API. Not idealy, but it might be the best for now.

In [4]: table = geopandas.io.arrow._geopandas_to_arrow(df)

In [5]: import pyarrow.parquet as pq

In [6]: pq.write_table(table, "out.parquet", version="2.6")

m-mohr commented 2 years ago

Thanks, seems to work.

stactools-packages / goes-glm

netCDF files (and conversion to a cloud-optimized format) #3