terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Define pipeline for converting bin files to NetCDF/HDF5 data products and transferring from MAC to NCSA #38

Closed dlebauer closed 8 years ago

dlebauer commented 8 years ago

Description

Goal is to convert, compress, and efficiently write data from imaging spectrometers to NetCDF / HDF5 data products. The data product format is specified in terraref/reference-data#14 and I posted a longer discussion of related issues, including data volume, transfer, and bottlenecks are described on the terraref website.

Implementation of this feature should be coordinated with / in support of the .bil->.nc and .bil->.hdf BrownDog DTS (Data Tilling Services) described in BrownDog issue 852 assigned to Eugene Roeder;

Based on discussion to date, this is the current draft.

  1. Data is collected by spectrometer sensor as .bil + .hdr files
  2. Data is written to a gantry-mounted computer
    • 2 x 1TB solid state drives (fast read/write)
    • writes to one drive while reading / sending data from the other
  3. Data goes over 10 Gigabit line to MAC Server
    • 70 TB cache server installed January 2016
    • 2 week cache + anything to support compression
  4. Data goes over dedicated 1 Gigabit line (avail. mid-January 2016) to U. Arizona
  5. Globus (?) transfer to Roger
  6. Clowder pipeline triggered

Output file should look like terraref/reference-data#14 (see 'TODO', below)

Example Raw data

These are from the HEADWALL sensors; the format is what we expect; the content is very different from what we will observe.

dlebauer commented 8 years ago

(from #2)

My inclination is to parse all this JSON metadata into an attribute tree in the netCDF4/HDF file. The file level-0 (root) group would contain a level-1 group called "lemnatec_measurement_metadata", which would contain six level-2 groups "user_given_data"..."measurement_additional_data" and each of those groups would contain group attributes for the fields listed above. We will use the appropriate atomic data type for each of the values encountered, e.g., String for most text, float for 32-bit reals, unsigned byte for boolean,... Some of the "gantry variable data" (like x,y,z location) will need to be variables not (or as well as) attributes, so that their time-varying values can be easily manipulated by data processing tools. They may become record variables with time as the unlimited dimension.

I think you have the right idea of parsing this to attributes, but I will note that the .json files are not designed to meet a standard metadata convention. But presumably a CF-compliant file will? Ultimately, we will want them to be compliant or interoperable with an FGDC-endorsed ISO standard (https://www.fgdc.gov/metadata/geospatial-metadata-standards). Does that sound reasonable?

Regarding gantry variable data like x,y,z location and time, I think it would be useful to store this as a meta-data attribute in addition to either dimensions or variables. When you say 'variables' do you mean to store the single value of x,y,z in the metadata as a set of variables? Ultimately these will be used to define the coordinates of each pixel. This is something that I don't understand well and don't know if there is an easy answer. As I understand it, we could transform the images to a flat xy plane that would allow gridded dimensions, but if we map to xyz then they would be treated as variables. I'd appreciate your thoughts on this and if you want to chat off line let me know.

dlebauer commented 8 years ago

Sorry the details of the hyperspectral data formats are specified in issue terraref/reference-data#14

czender commented 8 years ago

I will treat this (issue #38) as the correct place to discuss development of the pipeline. IMHO, aim to produce a CF-compliant file now. Later map that to whatever ISO flavor you want.

My understanding is that we will receive much metadata in JSON format and need to store it in the final product. Once the JSON is in the netCDF file, the JSON file will be redundant. JSON in no way is the final product. JSON is "just" a useful way of transmitting structured information in key/value syntax.

Putting something (like x, y, z) in both data and metdata is prone to error because people often manipulate one but not the other. If you want a spatial grid then x,y,z information must be present as variables to facilitate spatial hyperslabbing. Same with time; time needs to be a variable for the sake of hyperslabbing.

dlebauer commented 8 years ago

I see your point, but maybe I wasn't clear and /or not familiar enough with the technical aspects but I think we need to save the xyz location of the camera at the time of capture, which is distinct from the xyz of each pixel. The camera position seems like immutable metadata required for 'level 0 products' while downstream projection and calibration will be required to assign coordinates to pixels (aka dimensions or variables).

As you suggest, we will likely revise the projection algorithms and thus dimensions over time. So the key will be having enough metadata for this.

Getting a prototype to get feedback on is probably the best way forward.

dlebauer commented 8 years ago

@czender

A few notes

  1. To clarify, above you proposed to pass the json tree to netcdf attributes in the same data structure of nested key-value pairs. That should make it easy, and flexible as we refine the format, correct?
  2. The basic structure of the sensor metadata files have a good chance of being stable, major changes should include addition of fields - (terraref/reference-data#2).
  3. In particular, Location data will be passed using geojson as described by @max-zilla in #2.

Let me know if you have any questions or want to discuss via phone etc.

ghost commented 8 years ago

@czender Can you use Gdal? Talk to @robkooper

robkooper commented 8 years ago

@ch1eroe1 has done some work on this as well and is working on some code to add it to BrownDog. I believe we just used the gdal_convert to convert the .hdr file to a cdf file.

dlebauer commented 8 years ago

@ch1eroe1 do you have a link to the code that you wrote (or if it is a one-liner, paste here).

@czender Not necessary to use gdal, but will be useful to coordinate with @ch1eroe1, @robkooper and the BrownDog team. From what I understand, the gdal tool converts the file type but does not address optimization, handling of metadata, or developing data products.

czender commented 8 years ago

@dlebauer on the clarification

  1. Yes. If we have JSON as input then we will transfer its structure straight into metadata. Changes in structure can be made upstream (to the JSON by Lemnatec). We're assuming they've already grouped the metadata logically.
  2. That's all fine.
  3. Getting the location into a standard form that is simultaneously useful for analysis will require care, and probably some iteration.
czender commented 8 years ago

@robkooper and @ch1eroe1 yes just looked and we can try to use gdal_convert. if you already have a command to convert .hdr files (or similar) please post it with a link to a .hdf file that it works on and we will modify that to work on the reference images above. thanks!

robkooper commented 8 years ago
gdal_translate -of netCDF test_envi_class.envi test_envi_class.nc

convert bil file to netcdf. Need to specify the bil file as the first argument, assumption is that there is a second file with .hdr. Second argument is output. -of netCDF makes it so netcdf is output instead of GeoTiff.

czender commented 8 years ago

Thanks Rob. I have this working now.

dlebauer commented 8 years ago

@czender Sorry about the trouble with the Box link. I've put the SWIR sample files (~600MB) here: http://file-server.igb.illinois.edu/~dlebauer/terraref/

dlebauer commented 8 years ago

@czender : please create additional issues, add documentation / links to scripts in github.com/terraref/documentation (make a new file called hyperspectral_data_pipeline.md) or similar and then close this

  1. how does this tie into bigger picture?
    • where should files land, where should the outputs go?
    • inputs: /projects/arpae/terraref/raw_data/lemnatec_field/
    • outputs: /projects/arpae/terraref/outputs/lemnatec_field/
  2. How to speed up and compress?
  3. anything else?
czender commented 8 years ago

An alpha version of the pipeline now exists. As requested, I will close this issue and open a new issue at terraref/documentation#6