stactools-packages / goes-glm

stactools package for the Geostationary Lightning Mapper (GLM) on the GOES satellites
Other
2 stars 1 forks source link

netCDF files (and conversion to a cloud-optimized format) #3

Closed m-mohr closed 2 years ago

m-mohr commented 2 years ago

This product does not conform to the netCDF classic data model because it makes use of multiple unlimited dimensions.

Describes unlimited dimensions: https://docs.unidata.ucar.edu/netcdf-c/current/unlimited_dims.html

Global attributes such as id, featureType, datetimes etc. Can be used for collection and item metadata. See #4 for details.

It has a lot of variables, see Table 5.26.6-2. Each variable has only one dimension though. They are grouped as follows:

Additional variables with a scalar value, but additional metadata assigned:

Example in brackets are from the G16 example file.

Potential solution:

The main question here is what people actually use? I'm not a domain expert here and unsure about how people would use this.

Direct conversion is not possible, we need to rewrite this into columnar format:

TomAugspurger commented 2 years ago

A few brief thoughts, using this snippet:

```python import planetary_computer import fsspec import adlfs import xarray as xr import geopandas import pyproj import pandas as pd from typing import Tuple def to_dataframes(ds: xr.DataArray) -> Tuple[xr.Dataset, xr.Dataset, xr.Dataset]: flash_vars = [k for k, v in ds.data_vars.items() if "number_of_flashes" in v.dims] groups_vars = [k for k, v in ds.data_vars.items() if "number_of_groups" in v.dims] events_vars = [k for k, v in ds.data_vars.items() if "number_of_events" in v.dims] crs = pyproj.CRS.from_cf(ds.goes_lat_lon_projection.attrs) flashes = ds[flash_vars].to_dataframe() flashes = geopandas.GeoDataFrame( flashes, geometry=geopandas.array.points_from_xy(flashes.flash_lon, flashes.flash_lat), crs=crs ) groups = ds[groups_vars].to_dataframe() groups = geopandas.GeoDataFrame( groups, geometry=geopandas.array.points_from_xy(groups.group_lon, groups.group_lat), crs=crs, ) events = ds[events_vars].to_dataframe() events = geopandas.GeoDataFrame( events, geometry=geopandas.array.points_from_xy(events.event_lon, events.event_lat), crs=crs, ) return flashes, groups, events credential = planetary_computer.sas.get_token("goeseuwest", "noaa-goes16").token fs = adlfs.AzureBlobFileSystem("goeseuwest", credential=credential) files = fs.find("noaa-goes16/GLM-L2-LCFA/2022/001/") dataframes = [] # first 5 for testing. # Need to figure out a partitioning / row-group strategy for file in files[:5]: f = fs.open(file) ds = xr.open_dataset(f, engine="h5netcdf") dataframes.append(to_dataframes(ds)) a, b, c = zip(*dataframes) # ignore_index drops the *_count variable, which I think is OK # If we want to keep that (for mapping back to the original file) # we should reset_index() and include the file path as a new column flashes = pd.concat(a, ignore_index=True) groups = pd.concat(b, ignore_index=True) events = pd.concat(c, ignore_index=True) ```

One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).

m-mohr commented 2 years ago

Thanks for the thoughts, @TomAugspurger.

I think we have some work to do figuring out how to partition the files (by time, but how frequently? do we use row groups? etc.).

So the netCDF files are partitioned into intervals already, I think 20 seconds or so. If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.

If there's a desire to re-group multiple netCDF files into a single set of 3 parquet files, we can't really expose the netCDF files with them in a way that makes sense in STAC as then you'd have that don't cover the full temporal extent that is available. It would be somewhat possible, but doesn't feel STACish. Or you don't expose the netCDF files or generate two STAC Items, one for netCDF one for parquet file.

Not sure what the best way forward is? The first option would certainly best align with existing implementations, I think.

My snippet just sets the geometry as a series of Points, based on the lon / lat. I don't know if we can use the "area" column to make a polygon (probably not).

My implementation does the same and I think that is correct. Computing a polygon from the area seems like a wild guess. How many edges would that polygon have? Someone could also arque to use a circle or so...

One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).

Good catch. Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats. The question here is whether we should convert them to something that is more commonly used? Either UNIX_TIMESTAMP (i.e. since 1970) or ISO timestamps?

TomAugspurger commented 2 years ago

If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.

I completely forgot to think through that, and I agree that from a STAC point of view it makes sense to have a single Item pointing to the NetCDF and three parquet files.

This will make timeseries analysis a bit harder since you'll have to work with very many files, but I think that can be overcome with some effort.

Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats.

I see. I was misled by xarray decoding those integers as datetimes. If it isn't too much additional work then yes, I think it would be useful to convert them to timestamps.

m-mohr commented 2 years ago

@TomAugspurger Do you have a clue how I can set version to 2.6? I'm using geopandas.GeoDataFrame.to_parquet, but as it already has a version parameter for the geoparquet version, I can't use kwargs to pass through the version to arrow?

kwargs = {"version":"2.6"}
dataframe.to_parquet(file, **kwargs)
# => ValueError: version must be one of: 0.1.0, 0.4.0
TomAugspurger commented 2 years ago

Ah that's unfortunate. GeoDataFrame.to_parquet's version is used for the geoparquet version, and clashes with the version argument used by Table.to_parquet: https://github.com/geopandas/geopandas/blob/28462f8e7fd893e01be52f444e142a1a00c36f8e/geopandas/io/arrow.py#L284

I'll open an issue.

m-mohr commented 2 years ago

Thank you, was about to do the same but please go ahead @TomAugspurger .

How do we proceed? I have code ready that exports timestamps, the only issue now is that it is not precise enough and we'd likely need to wait for a fix?

TomAugspurger commented 2 years ago

https://github.com/geopandas/geopandas/issues/2495.

I included a workaround there, that uses a private geopandas API. Not idealy, but it might be the best for now.

In [4]: table = geopandas.io.arrow._geopandas_to_arrow(df)

In [5]: import pyarrow.parquet as pq

In [6]: pq.write_table(table, "out.parquet", version="2.6")
m-mohr commented 2 years ago

Thanks, seems to work.