Open HeatherSavoy-USDA opened 2 years ago
This is also related to #32 for describing input file structures.
And related to #4 for types of output files. Some file types will be better suited to combining variables/time.
Quick update: we will be discussing with the GeoCDL community this week which output file formats are desired.
Requested output formats from GeoCDL monthly meeting:
Raster:
Point:
Current progress:
output_format
for both subset endpoints in fdc55f99b71980846268c3619246264608b26b87Left to do:
geopandas
's to_file
- translate to xarray
firstOk, so: considering how/if to modify approach in fulfillRequestSynchronous()
that accumulates fout_paths
from _getRasterLayer()
or _getPointLayer()
. It's currently written to 1. loop over all datasets, variables, and time steps; 2. call one of those _get...
functions that will write data to file and return that filename for each iteration; and 3. zip up all of those files at the end. This makes sense for our current geotiff output case, but all other output options will combine variables/time.
Point data: I had started to modify the current approach to just append data to existing files, but that only works well for csv. For shapefiles and netcdf, more complicated merging seems necessary. Should we modify _getPointLayer()
to add data to a larger request-level geodataframe? Then at the end write that geodataframe to file? Would xarray
DataArray
have any advantages over geodataframe
beyond making writing to netcdfs easier?
Raster data: When we write netcdfs, we will at least concatenate time, so yay for the date_list
being the inner-most loop. xarray
's concat()
or merge()
can help combine data here. How to best handle one case of needing to combine data across iterations (netcdf) and one case of not (geotiffs?). Do like point data and accumulate data either way, then split as necessary when writing to file? Or keep current approach and combine data from temporary output files into another file? Although the later doesn't work in the current scheme either with _getRasterLayer()
returning those temporary file names. But xarray.open_mfdataset()
could help.
For what it's worth, the solution I've had in mind here is to separate output production from DataRequestHandler
. That was the design I had sketched out when when I did the big refactoring a while back, but I didn't get that part done at that time. I think that would be the cleanest design, though. Basically, DataRequestHandler
shouldn't need to know any details about how to generate various outputs; it only passes raw output data on to the proper components which handle the actual output generation.
This would be sort of similar to how DataRequest
is a generic abstraction of a data query, which means the request fulfillment part of the system doesn't have to know anything about how the request is received/initially processed (i.e., the front interface is independent of request processing). Same idea but with outputs instead of inputs, if that makes sense.
Based on @stuckyb's comment above, I have split all file output from DataRequestHandler
into a new DataRequestOutput
. For all output formats, time is recorded based on RequestDate
and date granularity, e.g. 'YYYY', 'YYYY-MM', or 'YYYY-MM-DD'. Any date formatting in native data files is overwritten (need to note it in metadata though, new issue coming soon).
Point data:
time
column. The file is in long format with columns time
, dataset
, variable
,value
,x
, and y
. x
and y
columns. time
dimension. Raster data:
Right now, we output the results as one file per variable per time increment. That file is a GeoTIFF if its raster output or a csv if point extraction output. It would be good if we could have the option to combine variables and/or time increments in the output. Particularly so for point output, but also a handy option for when we support NetCDF output for rasters as well.
Issue #31 is closely related: if we end up processing stacks of rasters, that makes combining the results more straightforward.