Solution for creation of new data products that depend on multiple inputs

craig-willis commented 7 years ago

As discussed in Spring 2017 planning meeting, the current Clowder extractor architecture does not support the creation of data products that rely on multiple sensor inputs, specifically the sensor fusion output or Roman's machine learning models.

We've discussed a few options:

Collection-level extractors (create a collection from multiple sensors)
External process running on Roger (cron task). This raises the question whether Clowder offers anything over a set of cron jobs.

Completion Criteria

[x] Design documentation
[x] Stakeholder review
[ ] New implementation tickets

ghost commented 7 years ago

@dlebauer - should the the sensor fusion meeting be separate from the machine learning model meeting?

Who is doing sensor fusion besides Solmaz?

ghost commented 7 years ago

David, Max, Craig, Rob and Jeff will meet to talk about this.

dlebauer commented 7 years ago

@craig-willis @max-zilla could you please summarize conclusions from the last meeting and identify next steps?

max-zilla commented 7 years ago

@dlebauer we have not had a formal meeting on this specific topic yet.

The approach I am anticipating is the second one: External process running on Roger (cron task)

...but only using this for complex modules that require e.g. an entire day of data, or several sensors' worth of data. In those cases we would be working against Clowder's capabilities to try and squeeze that functionality into existing framework - but this does not compel me to want to create ALL extractors as cron jobs like this. Having on-demand triggering for high-dataset-volume sensors like stereoTop is more efficient than 20 different threads of bulk processing we trigger on the clock throughout the day, and it preserves the ability for us to allow users to run or re-run extractors on demand without needing someone "on the inside" to alter job schedules and whatnot.

Still worth a discussion with @craig-willis , @jterstriep , @robkooper about the best way to approach this.

Current candidates for this "Cron Pipeline" flavor:

[ ] clipping of GeoTIFFs into plot-level images - stitch a day of scans into one large canvas of the field and clip by the plot geometries, using perhaps a "minute-level" of timestamp significant digits to combine 2+ images within a plot that might have been taken several seconds apart.
[ ] Sensor fusion combining FLIR, stereoTop, scanner3DTop datasets - will need to associate those either by time, or implement plot-level clipping and use the plot file for each sensor+day.
[x] Hyperspectral extractor, for high memory requirements

Based on discussions with @solmazhajmohammadi I'm inclined to accomplish the plot-level clipping first and make our lives easier on the sensor fusion. In issue #265 this will be happening and I'm going to push us to get at least one sample day clipped into plots for each sensor as a starting target.

max-zilla commented 7 years ago

I am going to write a new rule-checker extractor that will delegate incoming files to appropriate extractors based on whatever rules apply to each extractor. The delegates can be normal extractors, these complex scripts, etc. Rules can be more comprehensive than typical Clowder extraction pipeline. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/extractors-rulechecker/browse

max-zilla commented 7 years ago

This extractor is written and basically ready - once we have our stitching & clipping script ready from #265 we will use this to trigger those scripts on full days of data.

terraref / computing-pipeline

Solution for creation of new data products that depend on multiple inputs #248

Completion Criteria