output file structure and format

HeatherSavoy-USDA commented 2 years ago

Right now, we output the results as one file per variable per time increment. That file is a GeoTIFF if its raster output or a csv if point extraction output. It would be good if we could have the option to combine variables and/or time increments in the output. Particularly so for point output, but also a handy option for when we support NetCDF output for rasters as well.

Issue #31 is closely related: if we end up processing stacks of rasters, that makes combining the results more straightforward.

HeatherSavoy-USDA commented 2 years ago

This is also related to #32 for describing input file structures.

HeatherSavoy-USDA commented 2 years ago

And related to #4 for types of output files. Some file types will be better suited to combining variables/time.

HeatherSavoy-USDA commented 2 years ago

Quick update: we will be discussing with the GeoCDL community this week which output file formats are desired.

HeatherSavoy-USDA commented 2 years ago

Requested output formats from GeoCDL monthly meeting:

Raster:

geotiff per variable and time increment (already available)
NetCDF with time dimension - mixed votes if variables should be combined into one file as well. Any implementation worries for either option besides file size?

Point:

csv - wide format to combine variables (no mention of time)
shapefile
NetCDF

HeatherSavoy-USDA commented 2 years ago

Current progress:

I added the optional input parameter output_format for both subset endpoints in fdc55f99b71980846268c3619246264608b26b87
Drafted support for point results to append time to existing files. At the moment, each variable gets its own file. 8652d065f1a8a7d97bbce00905b12b49bcaf9420

HeatherSavoy-USDA commented 2 years ago

Left to do:

Combine variables in shapefiles - fields are stored wide so can't easily append
Once the above is figured out, let csv output be long format?
netcdf doesn't look to be supported by geopandas's to_file - translate to xarray first
netcdf for raster: need to work on concatenating time

HeatherSavoy-USDA commented 2 years ago

Ok, so: considering how/if to modify approach in fulfillRequestSynchronous() that accumulates fout_paths from _getRasterLayer() or _getPointLayer(). It's currently written to 1. loop over all datasets, variables, and time steps; 2. call one of those _get... functions that will write data to file and return that filename for each iteration; and 3. zip up all of those files at the end. This makes sense for our current geotiff output case, but all other output options will combine variables/time.

Point data: I had started to modify the current approach to just append data to existing files, but that only works well for csv. For shapefiles and netcdf, more complicated merging seems necessary. Should we modify _getPointLayer() to add data to a larger request-level geodataframe? Then at the end write that geodataframe to file? Would xarray DataArray have any advantages over geodataframe beyond making writing to netcdfs easier?

Raster data: When we write netcdfs, we will at least concatenate time, so yay for the date_list being the inner-most loop. xarray's concat() or merge() can help combine data here. How to best handle one case of needing to combine data across iterations (netcdf) and one case of not (geotiffs?). Do like point data and accumulate data either way, then split as necessary when writing to file? Or keep current approach and combine data from temporary output files into another file? Although the later doesn't work in the current scheme either with _getRasterLayer() returning those temporary file names. But xarray.open_mfdataset() could help.

stuckyb commented 2 years ago

For what it's worth, the solution I've had in mind here is to separate output production from DataRequestHandler. That was the design I had sketched out when when I did the big refactoring a while back, but I didn't get that part done at that time. I think that would be the cleanest design, though. Basically, DataRequestHandler shouldn't need to know any details about how to generate various outputs; it only passes raw output data on to the proper components which handle the actual output generation.

This would be sort of similar to how DataRequest is a generic abstraction of a data query, which means the request fulfillment part of the system doesn't have to know anything about how the request is received/initially processed (i.e., the front interface is independent of request processing). Same idea but with outputs instead of inputs, if that makes sense.

HeatherSavoy-USDA commented 2 years ago

Based on @stuckyb's comment above, I have split all file output from DataRequestHandler into a new DataRequestOutput. For all output formats, time is recorded based on RequestDate and date granularity, e.g. 'YYYY', 'YYYY-MM', or 'YYYY-MM-DD'. Any date formatting in native data files is overwritten (need to note it in metadata though, new issue coming soon).

Point data:

For CSV output, one file is written combining datasets. If mixed date granularity, the dates are written in their own granularity so there are mixed formats in the time column. The file is in long format with columns time, dataset, variable,value,x, and y.
For shapefile output, same format as CSV, except it has a geometry definition instead of x and y columns.
For netCDF output, there are 'time', 'x', and 'y' coordinates and data variables per requested dataset+variable. Mixed date grains are combined in time dimension.

Raster data:

For geotiff output, there aren't (or shouldn't be) any changes.
For netcdf output, same as point data above except datasets are not combined (even when spatially harmonized) to avoid large file sizes.

stuckyb / gcdl

output file structure and format #33