Complete PlantCV extractor

max-zilla commented 8 years ago

This is a continuation of this discussion: https://github.com/terraref/computing-pipeline/pull/108

Current plan:

@nfahlgren has written base code for this functionality. Can currently be called by command line script.
@yanliu-chn and @fwang29 are going to create a Clowder extractor that will call the PlantCV module of Noah's code.
This way, someone can use Noah's original script to push data into Clowder without an extractor, but we also have an extractor to do it automatically.

This diagram is intended to illustrate:

max-zilla commented 8 years ago

@fwang29 and I tested out her extractor code today, my branch is here: https://github.com/terraref/computing-pipeline/tree/plantcv-clowder-extractor

Should be ready to try deploying at this point.

This requires my PyClowder branch that includes dataset message handling. https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/browse?at=refs%2Fheads%2Fbugfix%2FCATS-554-add-pyclowder-support-for-dataset

max-zilla commented 7 years ago

@fwang29 has a version of this that includes @nfahlgren's average traits.

this morning I am looking at the small # of changes we made to PlantcvIndoorAnalysis. Noah, I'm going to propose a change to your script based on these - primarily what this will do is move some command-line specific components (such as fetching files from clowder with GET and POSTing results) outside the process_sv_images() so we get generic process functions that can be called from extractor and your command line script both. after that, if the outputs look good i think V1 of the extractor is ready.

the idea is that we end up with a script that works for both cases, but is also basically the same as what you wrote so you can update the actual base logic as necessary.

nfahlgren commented 7 years ago

@max-zilla sounds good to me.

max-zilla commented 7 years ago

@nfahlgren @yanliu-chn OK, just pushed a big-ish update. @fwang29 and I just tested this both by:

running your upload script
uploading 10 files in a snapshot to an existing dataset manually

Both paths generate the same outputs:

average traits CSV into dataset
metadata for each file with PlantCV outputs
metadata for dataset indicating extractor has completed successfully

If you wanted to schedule a 20 minute call to go over changes, I'm happy to do so. None of the logic in your code has changed:

process_tv_images (and sv) are split into two methods now - process_tv_images_core() contains much of the logic, but the way file paths are found is handled separately. this solves primary conflict where commandline needs to download files from Clowder first, but extractor does not. extractor just calls _core() version.
I moved some modules to separate utility functions that are shared by extractor and your script - this means we only have to change code in 1 place instead of 2 (e.g. in the extractor). these are get_traits_table(), generate_traits_list(), generate_average_csv().
some variable name changes to help my own sanity in understanding logic flow.

Only question is, previously the extractor was adding .TAB files to the dataset for each pair of NIR/VIS image. Following your script, Fengling now just writes that content to the file metadata of the respective images and the only new file is average_traits.csv. is this acceptable or should the TAB files be kept?

Otherwise testing seems fine. Noah let me know if you want to discuss - otherwise if you are willing to make sure it still works the same from command line I think this can be ready as a version 1.

extractor directory: https://github.com/terraref/computing-pipeline/tree/plantcv-clowder-extractor/scripts/plantcv/extractor
your updated script: https://github.com/terraref/computing-pipeline/blob/plantcv-clowder-extractor/scripts/plantcv/PlantcvClowderIndoorAnalysis.py

The main idea here was to make it easy for you Noah to update the actual PlantCV logic as needed in your script and the extractor pulls all relevant logic from that rather than copying any code or replicating behavior.

Actual extractor code was Fengling's work, I just reorganized things a bit.

nfahlgren commented 7 years ago

@max-zilla @yanliu-chn @fwang29 Looks good, thanks for getting this up and running. The only little thing I see is line 103 in the extractor, which looks like the list is hard-coded for 10 images. What if this lines was something like:

for i in [x for x in range(0, len(file_objs)) if x % 2 == 0]:

max-zilla commented 7 years ago

@nfahlgren I like that, an elegant solution.

That reminded me of one last thing I wanted to do, which is improve the check_message() logic in the extractor - this is the piece that flags a "file added to dataset" message as relevant or not. I just pushed a small update:

check_message() now checks if dataset already has plantCV metadata and skips if so.
it also checks if dataset has at least 10 files that pass the regex (for VIS/NIR, etc.) rather than just 10 files. this is a stronger filter to prevent unnecessary processing.
contents of avg_traits.csv are no longer written as dataset metadata, this was a debugging step to verify contents of CSV file.

i might ask @robkooper to skim this if he has time but otherwise think we're ready.

max-zilla commented 7 years ago

Also created a pull request: https://github.com/terraref/computing-pipeline/pull/131

yanliu-chn commented 7 years ago

Sorry that I was on two consecutive business trips that were intense.

I'm trying to deploy the extractor on Nebular. Looks like the newest plantcv extractor is in the master branch. Two more questions:

@nfahlgren does the plantcv code still rely on the specific opencv version 2.4.8?
@max-zilla could you tell me again which pyClowder version to use for running the extractor? is it still bugfix/CATS-554-add-pyclowder-support-for-dataset or a newer pyClowder version?

Thanks!

nfahlgren commented 7 years ago

@yanliu-chn I think any opencv version in the 2.4.x branch will work. I know that currently the 3.x branch will not work.

yanliu-chn commented 7 years ago

@nfahlgren thanks!

max-zilla commented 7 years ago

@yanliu-chn yes that is the correct PyClowder branch to use.

yanliu-chn commented 7 years ago

@max-zilla @fwang29 did you test the extractor using @nfahlgren upload script? could you provide a test snapshot raw data with the json metadata and SnapshotInfo.csv in it? Noah once told me the location of one. But I couldn't find it on github issues.

fwang29 commented 7 years ago

https://drive.google.com/file/d/0B-_VMDWHLIgCemwtUl9QTnN2OW8/view Yes. And this link should have what you need.

On Tue, Jul 26, 2016 at 3:07 PM, Yan Liu notifications@github.com wrote:

@max-zilla https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_max-2Dzilla&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=YbtmrzRWo9y25Q8hbbbAtU_T5OOd8LcaX5KmxWJk-iE&m=R8xtjLsnjYPSRKcZiGhSA0-3MRNS3dK1gO0PPUqbhGI&s=L-2_BH8jgnsMgTm3qO5rMLtNxbjKXwqHJ1IeNdUnhJo&e= @fwang29 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_fwang29&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=YbtmrzRWo9y25Q8hbbbAtU_T5OOd8LcaX5KmxWJk-iE&m=R8xtjLsnjYPSRKcZiGhSA0-3MRNS3dK1gO0PPUqbhGI&s=VOintiXumT0-OsXC4e78_Jdx9tSe-Q6PNcCras0uAUY&e= did you test the extractor using @nfahlgren https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_nfahlgren&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=YbtmrzRWo9y25Q8hbbbAtU_T5OOd8LcaX5KmxWJk-iE&m=R8xtjLsnjYPSRKcZiGhSA0-3MRNS3dK1gO0PPUqbhGI&s=80GJ52Gu3nSlGLwlpeQjoFm-GkIm4uyet1PZwV9Auhc&e= upload script? could you provide a test snapshot raw data with the json metadata and SnapshotInfo.csv in it? Noah once told me the location of one. But I couldn't find it on github issues.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_terraref_computing-2Dpipeline_issues_119-23issuecomment-2D235388582&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=YbtmrzRWo9y25Q8hbbbAtU_T5OOd8LcaX5KmxWJk-iE&m=R8xtjLsnjYPSRKcZiGhSA0-3MRNS3dK1gO0PPUqbhGI&s=2UJwdLBB7vigjTOmhL8_v5a5d2hz8XsGpOali2SysXQ&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AIbeq2CRyuWE00Q-2Dfwzo9WQHKEZimJvMks5qZmj4gaJpZM4I9AAe&d=CwMCaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=YbtmrzRWo9y25Q8hbbbAtU_T5OOd8LcaX5KmxWJk-iE&m=R8xtjLsnjYPSRKcZiGhSA0-3MRNS3dK1gO0PPUqbhGI&s=ClNL4uKHoQyH8lOJ-MH5--OHNgBqX3m8qk-N2JDm68U&e= .

Fengling Wang B.S. Computer Engineering University of Illinois at Urbana-Champaign, 2017 (217)418-4971 | fwang29@illinois.edu http://kedamonobaby.co.nf

yanliu-chn commented 7 years ago

I have deployed the extractor on Nebula . Refer to #68

max-zilla commented 7 years ago

I am going to close this issue. The BETYdb piece remains: https://github.com/terraref/computing-pipeline/issues/33

And getting Globus pipeline from Danforth in place: https://github.com/terraref/computing-pipeline/issues/136

terraref / computing-pipeline

Complete PlantCV extractor #119