Closed m-mohr closed 2 years ago
A few brief thoughts, using this snippet:
One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).
Thanks for the thoughts, @TomAugspurger.
I think we have some work to do figuring out how to partition the files (by time, but how frequently? do we use row groups? etc.).
So the netCDF files are partitioned into intervals already, I think 20 seconds or so. If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.
If there's a desire to re-group multiple netCDF files into a single set of 3 parquet files, we can't really expose the netCDF files with them in a way that makes sense in STAC as then you'd have that don't cover the full temporal extent that is available. It would be somewhat possible, but doesn't feel STACish. Or you don't expose the netCDF files or generate two STAC Items, one for netCDF one for parquet file.
Not sure what the best way forward is? The first option would certainly best align with existing implementations, I think.
My snippet just sets the geometry as a series of Points, based on the lon / lat. I don't know if we can use the "area" column to make a polygon (probably not).
My implementation does the same and I think that is correct. Computing a polygon from the area seems like a wild guess. How many edges would that polygon have? Someone could also arque to use a circle or so...
One note: we would need to use parquet version 2.6 when writing the parquet files, to write the timestamps with nanosecond precision (https://arrow.apache.org/docs/python/parquet.html#storing-timestamps).
Good catch. Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats. The question here is whether we should convert them to something that is more commonly used? Either UNIX_TIMESTAMP (i.e. since 1970) or ISO timestamps?
If we want to expose netCDF and Parquet files in the same STAC item, the natural and STAC-compliant way would be to simply keep that grouping and generate 3 parquet files for a single netCDF.
I completely forgot to think through that, and I agree that from a STAC point of view it makes sense to have a single Item pointing to the NetCDF and three parquet files.
This will make timeseries analysis a bit harder since you'll have to work with very many files, but I think that can be overcome with some effort.
Right now there are no timestamps involved, because all temporal values seem to be seconds since 2020-12-31 23:59:40.000, so basically normal floats.
I see. I was misled by xarray decoding those integers as datetimes. If it isn't too much additional work then yes, I think it would be useful to convert them to timestamps.
@TomAugspurger Do you have a clue how I can set version to 2.6?
I'm using geopandas.GeoDataFrame.to_parquet
, but as it already has a version
parameter for the geoparquet version, I can't use kwargs to pass through the version to arrow?
kwargs = {"version":"2.6"}
dataframe.to_parquet(file, **kwargs)
# => ValueError: version must be one of: 0.1.0, 0.4.0
Ah that's unfortunate. GeoDataFrame.to_parquet's version
is used for the geoparquet
version, and clashes with the version
argument used by Table.to_parquet: https://github.com/geopandas/geopandas/blob/28462f8e7fd893e01be52f444e142a1a00c36f8e/geopandas/io/arrow.py#L284
I'll open an issue.
Thank you, was about to do the same but please go ahead @TomAugspurger .
How do we proceed? I have code ready that exports timestamps, the only issue now is that it is not precise enough and we'd likely need to wait for a fix?
https://github.com/geopandas/geopandas/issues/2495.
I included a workaround there, that uses a private geopandas API. Not idealy, but it might be the best for now.
In [4]: table = geopandas.io.arrow._geopandas_to_arrow(df)
In [5]: import pyarrow.parquet as pq
In [6]: pq.write_table(table, "out.parquet", version="2.6")
Thanks, seems to work.
Describes unlimited dimensions: https://docs.unidata.ucar.edu/netcdf-c/current/unlimited_dims.html
Global attributes such as id, featureType, datetimes etc. Can be used for collection and item metadata. See #4 for details.
It has a lot of variables, see Table 5.26.6-2. Each variable has only one dimension though. They are grouped as follows:
number_of_flashes
(flash_count: 179)number_of_groups
(group_count: 3706)number_of_events
(event_count: 11236)number_of_time_bounds
number_of_field_of_view_bounds
-> metadata, i.e. the bounding boxnumber_of_wavelength_bounds
Additional variables with a scalar value, but additional metadata assigned:
Example in brackets are from the G16 example file.
Potential solution:
The main question here is what people actually use? I'm not a domain expert here and unsure about how people would use this.
Direct conversion is not possible, we need to rewrite this into columnar format: