terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

Panicle detection extractor deployment #485

Open ZongyangLi opened 6 years ago

ZongyangLi commented 6 years ago

We are going to deploy our panicle detection extractor on Bioinformatics core facility at the Donald Danforth Plant Science Center(HTCondor). I am now creating this issue to address the current status and bugs that blocking the deployment. Associate codes had been uploaded here: https://github.com/terraref/extractors-3dscanner/tree/ZongyangLi-patch-1/panicle_detection

My initial plan(I am happy to make essential updates if there are better plans) for the pipeline workflow on HTCondor is:

  1. Automatically download 3d laser data from globus using globus_python_sdk.
  2. Full day process to generate useful plot level data.
  3. submit data to BETYdb or geostream.
  4. delete downloaded 3d laser data from HTCondor storage.

Current status:

  1. The Faster-RCNN framework has been correctly setup on HTCondor, we had created example output from the server.
  2. Automatically download 3d laser data from globus using globus_python_sdk had been tested on my local desktop.

What is blocking the deployment?

  1. We would like to access plot boundaries from betydb using terrautils.betydb.py instead of using hard coded boundaries for different seasons. But there still some inconsistent from season to season, includes:

    • Subplot boundaries missing, for example season 2 do not have any subplot boundaries.
    • KSU data and MAC data mixing, by using the query request get_site_boundaries(str_date, city="Maricopa"), we will still get KSU data
    • Plot boundaries missing, it should be 864 plots or 1728 subplots, but sometimes it only provides 350 plots(2016-09-27)
    • For a key attribute 'coordinates', the depth of the attribute is different from date to date, for example, in 2016-09-27, latlng data are saved in ['coordinates'][0][i](i from 0 to 3), but in 2017-04-27, data are saved in ['coordinates'][0][0][i], we need to go one more layer into the attribute.
  2. We need to determine what output data is needed. Now we are able to generate plot level panicle counting, average or median of panicle volume, density, 2d area.

@dlebauer @max-zilla I remember we had discussed about the inconsistent data from betydb boundaries, I tried to fix it during my parsing codes. But it seems continues to create more and more unexpected values now. I think we really need to figure it out this time.

dlebauer commented 6 years ago

@ZongyangLi Last time we looked into the boundary issues the problem was in terrautils so I will defer to @max-zilla first ... I wouldn’t be surprised if it is in BETYdb, but it would be easier to debug this if starting with either the specific API calls or the analogous SQL queries that are generating the unexpected data.

With respect to not having subplots in season 2, that is because they were never generated... you can see some discussion here https://github.com/terraref/reference-data/issues/194#issuecomment-33368094. Are there different numbers on other dates? It is good to point out this inconsistency, though I don’t think it is an error. Rather, it is an artifact of working out how we are analyzing data in the first season. Therefore, I wouldn’t worry that there are only 350 plots on 2016-09-28 unless there is a very different number of plots on an adjacent date or within the season

With respect to subplots, they are primarily usefull when training or validation data is collected at the sub-plot level. However, I think it would be sensible (though not necessary) to exclude them from the automated pipeline by default.

ZongyangLi commented 6 years ago

@dlebauer I did another test for boundary after season 2, I am happy that they are consistency, all of them have subplot information. I will continue the deploying and ignore season 2.

With respect to trait type, I am going to generate panicle counting number in the first run. The planned traits field would be: 'local_datetime', 'panicle_counting', 'access_level', 'species', 'site', 'citation_author', 'citation_year', 'citation_title', 'method' Please let me know if this make sense.

Another question is about target dates. Since sorghum season with panicles are quite few(compare to all year), I think we don't have to run the pipeline for whole year. Maybe middle and late sorghum season would be enough, for example in season 4 from 2017-06-20 to 2017-09-11, and current season from 2018-06-20 to the end of current season? Could someone confirm this is a reasonable time for sorghum have those panicles? @NewcombMaria

The last question is about Danforth HTCondor, I would like to make sure whether the download-analysis-delete workflow might make any damage to the storage or the service. @nfahlgren

nfahlgren commented 6 years ago

@ZongyangLi there's no harm in temporarily downloading the data to analyze it and then deleting it. I'm happy to look over your job setup when you are ready.

dlebauer commented 6 years ago

there's no harm in temporarily downloading the data to analyze it and then deleting it

This is the way we will handle the new distributed workflow for cases like this where our host system doesn’t have the resources (e.g. for reprocessing or in this case GPUs).

Development of the pipeline is currently a work in progress (see https://github.com/terraref/computing-pipeline/issues/473) but worth checking out and feedback welcome.

nfahlgren commented 6 years ago

@dlebauer nice, I haven't used Pegasus workflows myself yet, just DAG with condor. I was looking into Parsl (http://parsl-project.org/) and Nextflow (https://www.nextflow.io/) recently for use with PlantCV. Just to throw those out there.

ZongyangLi commented 6 years ago

@dlebauer Here is an example output csv file from the extractor: https://drive.google.com/open?id=1SLQeyhlMeaFou5_BHXkljmwNHkveUFV9

Could you please take a look if all the records in the file is suitable.

@nfahlgren Last time we talk about the globus auto transform on the server, I may still need your help to finish this, since my current code base on a desktop implements.

dlebauer commented 6 years ago

@ZongyangLi yes, that should work once you have added the new variable and method to the appropriate tables; then I would test it out against the database.

And I would use a variable name like "panicle_number" with a 'standard name' "number_of_panicles_per_unit_area" (this later is consistent with the CF standard names and units 'm^-2'.