stuckyb / gcdl

6 stars 2 forks source link

output file structure and format #33

Open HeatherSavoy-USDA opened 2 years ago

HeatherSavoy-USDA commented 2 years ago

Right now, we output the results as one file per variable per time increment. That file is a GeoTIFF if its raster output or a csv if point extraction output. It would be good if we could have the option to combine variables and/or time increments in the output. Particularly so for point output, but also a handy option for when we support NetCDF output for rasters as well.

Issue #31 is closely related: if we end up processing stacks of rasters, that makes combining the results more straightforward.

HeatherSavoy-USDA commented 2 years ago

This is also related to #32 for describing input file structures.

HeatherSavoy-USDA commented 2 years ago

And related to #4 for types of output files. Some file types will be better suited to combining variables/time.

HeatherSavoy-USDA commented 2 years ago

Quick update: we will be discussing with the GeoCDL community this week which output file formats are desired.

HeatherSavoy-USDA commented 2 years ago

Requested output formats from GeoCDL monthly meeting:

Raster:

Point:

HeatherSavoy-USDA commented 2 years ago

Current progress:

HeatherSavoy-USDA commented 2 years ago

Left to do:

HeatherSavoy-USDA commented 2 years ago

Ok, so: considering how/if to modify approach in fulfillRequestSynchronous() that accumulates fout_paths from _getRasterLayer() or _getPointLayer(). It's currently written to 1. loop over all datasets, variables, and time steps; 2. call one of those _get... functions that will write data to file and return that filename for each iteration; and 3. zip up all of those files at the end. This makes sense for our current geotiff output case, but all other output options will combine variables/time.

Point data: I had started to modify the current approach to just append data to existing files, but that only works well for csv. For shapefiles and netcdf, more complicated merging seems necessary. Should we modify _getPointLayer() to add data to a larger request-level geodataframe? Then at the end write that geodataframe to file? Would xarray DataArray have any advantages over geodataframe beyond making writing to netcdfs easier?

Raster data: When we write netcdfs, we will at least concatenate time, so yay for the date_list being the inner-most loop. xarray's concat() or merge() can help combine data here. How to best handle one case of needing to combine data across iterations (netcdf) and one case of not (geotiffs?). Do like point data and accumulate data either way, then split as necessary when writing to file? Or keep current approach and combine data from temporary output files into another file? Although the later doesn't work in the current scheme either with _getRasterLayer() returning those temporary file names. But xarray.open_mfdataset() could help.

stuckyb commented 2 years ago

For what it's worth, the solution I've had in mind here is to separate output production from DataRequestHandler. That was the design I had sketched out when when I did the big refactoring a while back, but I didn't get that part done at that time. I think that would be the cleanest design, though. Basically, DataRequestHandler shouldn't need to know any details about how to generate various outputs; it only passes raw output data on to the proper components which handle the actual output generation.

This would be sort of similar to how DataRequest is a generic abstraction of a data query, which means the request fulfillment part of the system doesn't have to know anything about how the request is received/initially processed (i.e., the front interface is independent of request processing). Same idea but with outputs instead of inputs, if that makes sense.

HeatherSavoy-USDA commented 2 years ago

Based on @stuckyb's comment above, I have split all file output from DataRequestHandler into a new DataRequestOutput. For all output formats, time is recorded based on RequestDate and date granularity, e.g. 'YYYY', 'YYYY-MM', or 'YYYY-MM-DD'. Any date formatting in native data files is overwritten (need to note it in metadata though, new issue coming soon).

Point data:

Raster data: