terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
24 stars 13 forks source link

Insert derived traits and meta-data from PlantCV pipeline into BETYdb #33

Closed dlebauer closed 8 years ago

dlebauer commented 8 years ago

After phenotypes are extracted from PlantCV pipeline, plant-level traits should be inserted into BETYdb, Traits described in Fahlgren et al 2015.

Meta-data:

max-zilla commented 8 years ago

Some related discussion in #36 regarding metadata from the LemnaTec source into BETYdb.

dlebauer commented 8 years ago

@gsrohde

Noah defined the output csv file in terraref/computing-pipeline/issues/31#issuecomment-215169871 as:

plant_id,datetime,treatment,genotype,group,date,dap,solidity,outlier,sv_area,tv_area,extent_x,extent_y,height_above_bound,fw_biomass,dw_biomass,height_width_ratio,tiller_count,wue
Dp2AB000127,1386431350,0,B100,B100-0,2013-12-07 09:49:10,11.4091435185185,0.478404862032,False,2086.72707778917,2955.32688355849,1.98029381595354,1.26862572584524,1.26862572584524,-0.192056152391855,-0.0399180898450307,0.640625,3.81555081933358,11.465533394446

which can be mapped to BETYdb as:

plant_id --> entity datetime treatment genotype --> cultivar group --> ?? date --> dap --> managements (a single entry w/ planting_date = date - days(dap))

the following are traits:

solidity,outlier,sv_area,tv_area,extent_x,extent_y,height_above_bound,fw_biomass,dw_biomass,height_width_ratio,tiller_count,wue

@nfahlgren can you please define these traits and the methods used to compute them? I've made a spreadsheet that you can edit here: https://docs.google.com/spreadsheets/d/13AiwuzlFNBxV1uEXnkQ4aEg4r5WEbxRFTbocxlISL00/edit?usp=sharing

variables --> name, units, definition methods --> name, description, citation

nfahlgren commented 8 years ago

@dlebauer @gsrohde I filled out the table more, can you guys take a look and let me know what you think?

dlebauer commented 8 years ago

@nfahlgren can you send a sample csv file + metadata to Scott

nfahlgren commented 8 years ago

Is this accessible to you guys? http://141.142.209.122/clowder/files/574d17f3e4b0efbe2dc4adf1

max-zilla commented 8 years ago

@nfahlgren after I logged in, I was able to download the file. They'll need an account to download.

nfahlgren commented 8 years ago

Here's the contents of the file also:

plant_barcode,genotype,treatment,imagedate,sv_area,tv_area,hull_area,solidity,height,perimeter
Fp001AA006740-L,BTx642,100%: 217 ml water (47.6% VWC),2014-06-23 16:55:57.625,144539.25,285574.0,1330688.875,0.116381346163,1665,16802.1037833
nfahlgren commented 8 years ago

@gsrohde: for the pilot experiment, the genotypes included were:

BTx623
BTx642
Tx7000
Tx430

The treatments were:

100%: 217 ml water (47.6% VWC)
80%: 173.6 ml water (37.5% VWC)
60%: 130.2 ml water (27.3% VWC)
40%: 86.8 ml water (17.2% VWC)
gsrohde commented 8 years ago

@nfahlgren

Below I’ve listed the column headings used in the sample you sent, together with the sample value you provided.

After each key-value pair, I list the column heading the should be used (if known) along with some further discussion.

After this, I discuss what other information must be included and what other information you may wish to include.

(@dlebauer Please read this over and see if there are other considerations I have neglected.)

Column List

plant_barcode: Fp001AA006740-L

I’m assuming there is a functional dependency whereby genotype determines barcode. Thus, this column need not be included and will be ignored if it is. (Currently, the API allows extraneous columns. This may change, or at least a warning may be returned if extraneous columns are included.)

genotype: BTx642

Use cultivar. Each value in this column must match some value of the name column in the cultivars table.

treatment: 100%: 217 ml water (47.6% VWC)

The column name is correct. Each value must exactly match some value of the name column in the treatments table.

imagedate: 2014-06-23 16:55:57.625

Use utc_datetime if the time is given in UTC; use local_datetime if the time is given in local (site) time. In the former case, the letter Z must appear directly following the last digit of the time (no space). In the later case, a site column must be included, giving the name of the site at which the data was collected, and that site must have a value in the column time_zone. In both cases, the space between the date and the time in each column value must be replaced by the letter T. (If this requirement proves too burdensome, I will consider changing the API to allow a space here.)

sv_area: 144539.25, tv_area: 285574.0, hull_area: 1330688.875, solidity: 0.116381346163, height: 1665, perimeter: 16802.1037833

I’m assuming these six columns are all either names of trait variables or names of trait covariates. In order to be recognized, the column heading must exactly match the value of the name column in some row of the variables table AND must appear as either a trait or a covariate in the trait_covariate_associations table. The values must be in the units specified in the variables table.

Required additional columns

Optional additional columns

Note that I am considering changing the API to allow values that are constant for all of the data in the CSV file to be specified in the API URL's query string so that it needn’t be repeated in every row of the table. For example, if I implement this, a user could specify that all the data in the file is for Sorghum bicolor by including species=Sorghum+bicolor in the query string instead of including a species column in the CSV file having the value “Sorghum bicolor” in every single row. I’m also considering setting a default value for access_level so that it doesn’t have to be specified if the default value is acceptable.

dlebauer commented 8 years ago

@gsrohde I've added the citation, site, treatments, and cultivars to terraref.ncsa.illinois.edu/bety

The sitename is "Danforth Plant Science Center Bellweather Phenotyping Facility"

@nfahlgren please sign up for an account on that site and check that these are okay / or edit to add additional information:

Presumably these are all Sorghum bicolor (?)

dlebauer commented 8 years ago

@gsrohde please add sv_area, tv_area, hull_area, solidity, height, and perimeter to the variables table. They are defined in this google doc.

gsrohde commented 8 years ago

@dlebauer

Done. I mapped columns as follows:

name --> name units --> units definition --> description comments --> notes min_value --> min max_value --> max (set to +Infinity if not given)

I left type blank, since the table column doesn't refer to the data type as the Google doc table does.

I changed cm^2 to cm2 in line with CF Guidelines.

I didn't fill in standard_name or standard_units.

I left the units for solidity blank, but perhaps we should use "fraction" as we do elsewhere. Also, even though min_value and max_value are left blank in the table, I assumed they should be 0 and 1, respectively.

nfahlgren commented 8 years ago

@dlebauer I signed up for a BETYdb account. The information you added looks good. I will have to check on the additional information about the sorghum varieties used in the pilot.

@gsrohde you are correct about plant_barcode, we don't need it in the database (redundant with cultivar and treatment). For the column mappings above (e.g. genotype => cultivar), do we need to change the nomenclature in Clowder or is this just how they are being mapped? Happy to change it on the Clowder end.

@gsrohde the current imagedate is in local_datetime, so we would need to add the timezone information to the Danforth Center site entry. If UTC is preferred I can have the extractor convert it.

gsrohde commented 8 years ago

@nfahlgren The plant barcode could go in the database (in the cultivar notes) if desired, though it looks like David hasn't done this as yet. My point was that it doesn't need to be in the CSV since it needs only to be entered once for each cultivar.

I think I would prefer keeping the name of the column heading "cultivar" since I imagine for some users of the API, this name will seem more apt. I could make "genotype" a synonym and accept either if this seems preferable to changing it on the Clowder end.

Local datetime is fine; there's no need to convert it.

gsrohde commented 8 years ago

@dlebauer Is there a revised CSV file that is ready for me to upload?

dlebauer commented 8 years ago

@nfahlgren has the format changed since https://github.com/terraref/computing-pipeline/issues/33#issuecomment-228118036 ?

nfahlgren commented 8 years ago

Not yet, but I think I need to update some column names to better match the BETYdb API.

Noah

Director, Bioinformatics Core Facility Donald Danforth Plant Science Center 975 North Warson Road St. Louis, MO 63132 Email: nfahlgren@danforthcenter.org Phone: 314-587-1676

On Wed, Jul 13, 2016 at 3:45 PM, David LeBauer notifications@github.com wrote:

@nfahlgren https://github.com/nfahlgren has the format changed since #33 (comment) https://github.com/terraref/computing-pipeline/issues/33#issuecomment-228118036 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-232481122, or mute the thread https://github.com/notifications/unsubscribe/AF8dpD7y71qn8lvB7ifLAKTLb0FBtgWqks5qVU5kgaJpZM4GkHiR .

dlebauer commented 8 years ago

@nfahlgren not sure which column names you are referring to but the most important thing about the column names is that they don't change.

In principle, the output of PlantCV can be independent of where it is going. Any code that is BETYdb-specific can be executed independent of PlantCV.

If would be great if all extractors wrote to a standard format but we don't currently have one (for traits).

nfahlgren commented 8 years ago

That works too, if it's best to leave them alone and have a translator instead.

Noah

On Jul 13, 2016, at 6:13 PM, David LeBauer notifications@github.com wrote:

@nfahlgren not sure which column names you are referring to but the most important thing about the column names is that they don't change.

In principle, the output of PlantCV can be independent of where it is going. Any code that is BETYdb-specific can be executed independent of PlantCV.

If would be great if all extractors wrote to a standard format but we don't currently have one (for traits).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ghost commented 8 years ago

Dependant on extractor pull request

ghost commented 8 years ago

@max-zilla : is there an extractor for meta-data in place yet? Issue #? @dlebauer : who should write the translator for the files?

max-zilla commented 8 years ago

@nfahlgren @rachelshekar I just merged the extractor pull request. I wasn't sure if the discussion about PlantCV metadata mapping was finalized but I'll start up on this first thing next week if I don't hear otherwise.

max-zilla commented 8 years ago

@nfahlgren @rachelshekar @dlebauer starting work on this under this branch: https://github.com/terraref/computing-pipeline/tree/plantcv_bety_extractor

edit: Actually this maybe doesn't need to be a separate extractor... sounds like the basic step is simply to upload the output CSV from the PlantCV extractor to the BETYdb upload endpoint. This could be done in the PlantCV extractor at the same time the CSV is generated for upload into Clowder.

@gsrohde @nfahlgren can we think of reasons why we'd want the BETY submission piece separate from the PlantCV processing piece? I'm not coming up with any.

max-zilla commented 8 years ago

After discussion with @gsrohde and @dlebauer it sounds like a generic BETYdb extractor is preferable. This could check for any CSV files or metadata that match some criteria (e.g. recognized column names) to insert into BETYdb. We can start very specific (like looking for avg_traits.csv from @nfahlgren's PlantCV extractor) and generalize as we go forward.

Some comments from email traffic with Scott: Here’s a summary of what’s needed that might be more succinct and digestible:

Heading name changes:

imagedate —> local_datetime (or perhaps utc_datetime if that’s what it is) genotype —> cultivar plant_barcode —> entity: Per conversation with David this morning, this will become the name for the entity associated with the row. I will need to implement handling of this. Currently, an anonymous entity is generated for each row in order to group trait measurements made at the same time on the same plant or plant group.

Missing required columns:

access_level

Missing recommended columns:

species site (This is required if you want to use local time for the datetime column.)

Other columns we may wish to include:

citation_doi OR citation_author, citation_year, and citation_title

“treatment” is a recognized column name, and according to David, all of the other columns match bona fide trait variable names in the target database, so they are all fine. The CSV upload machinery takes care of looking up metadata ids, so you don’t need to do this in the extractor.

Since it seems likely that accesslevel, species, and site (and citation*) will have the same value for every row in the table, it might be worthwhile implementing a way of specifying this globally per file. (Note that there already is a way of specifying metadata globally for data submitted in JSON or XML format, so I don’t anticipate that extending this to CSV would be too difficult.) For access_level, we could instead or in addition give it a default value so that in wouldn’t need to be included if the default is acceptable. David, do you have an opinion about this? On the other hand, since it should be easy to generate the missing columns in the extractor (or elsewhere), perhaps this isn’t an important feature.

robkooper commented 8 years ago

I would love to work with Luigi who has written some code to easily create a mapping between a CSV file and other names.

max-zilla commented 8 years ago

@gsrohde @nfahlgren I am modifying the PlantcvClowderIndoorAnalysis script slightly based on Scott's comments above.

dlebauer commented 8 years ago

access_level is 2, this means for internal collaborators On Mon, Aug 22, 2016 at 8:47 AM Max Burnette notifications@github.com wrote:

@gsrohde https://github.com/gsrohde @nfahlgren https://github.com/nfahlgren I am modifying the PlantcvClowderIndoorAnalysis script slightly based on Scott's comments above.

  • renamed genotype, plant_barcode columns
  • ready to rename imagedate - @nfahlgren https://github.com/nfahlgren

    do you know if you're writing UTC or local time?

    what is a suitable default value for access_level?

    species and site we could hardcode for now, if they already exist in betydb?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-241417526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5xjq6lT5idAgq5fLdQdQc_Ohs1bSks5qiaiEgaJpZM4GkHiR .

nfahlgren commented 8 years ago

@max-zilla it's local time. @dlebauer added the site (https://terraref.ncsa.illinois.edu/bety/sites/6000000866) and species (https://terraref.ncsa.illinois.edu/bety/species/2588) to our BETYdb instance.

gsrohde commented 8 years ago

@max-zilla Note that the columns site.sitename and species.scientificname are the effective keys that are used when looking up a match for the corresponding data in the CSV file (not the id columns), so in this case, the CSV file should have Danforth Plant Science Center Bellweather Phenotyping Facility in the site column and Sorghum bicolor in the species column. (If and when we use JSON or XML input files, there is more flexibility in the way lookups are done.)

max-zilla commented 8 years ago

OK, my branch is here: https://github.com/terraref/computing-pipeline/compare/plantcv_extractor_updates

This changes the fields/default values as suggested above. Only change now I believe is to add a config parameter with a BETYdb URL, which if given will be used to POST the CSV into BETY.

@gsrohde do we know the URL/API endpoint to use already?

gsrohde commented 8 years ago

The URL for the API endpoint is <host>/api/beta/traits.csv where I think host is to be https://terraref.ncsa.illinois.edu/bety/ (that's the host that Noah references above). You need to set the API key parameter in a query string (?key=...). See the documentation at https://pecan.gitbooks.io/betydbdoc-dataentry/content/trait_insertion_api.html. One thing that isn't (yet) mentioned there is that if you are posting a CSV file, you should explicitly set the content type in the post request. For example, if you are using curl, add the option -H "Content-Type: text/csv". (Many files will work without this, but if the file happens to include a % character, things are likely to go horribly awry since it will likely default to Content-Type: application/x-www-form-urlencoded and the % character will be interpreted as the beginning of a URL escape sequence.)

Please let me know before you actually do a post to a real site. You are of course free (and encouraged) to post to any test copy. (You will have to put the relevant metadata in place first for the post to succeed.)

max-zilla commented 8 years ago

OK. I've added BETYdb uploading functionality to the branch I listed shortly above this one. Have not yet tested; will create a pull request once I do.

max-zilla commented 8 years ago

Pull request created: https://github.com/terraref/computing-pipeline/pull/154

@gsrohde, is there a test copy of Bety that already has the relevant metadata to test this, and if not can you briefly list which metadata needs to be in place first? Thanks.

dlebauer commented 8 years ago

Only way to make a test copy would be to dump bety6 On Thu, Aug 25, 2016 at 8:44 AM Max Burnette notifications@github.com wrote:

Pull request created: #154 https://github.com/terraref/computing-pipeline/pull/154

@gsrohde https://github.com/gsrohde, is there a test copy of Bety that already has the relevant metadata to test this, and if not can you briefly list which metadata needs to be in place first? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-242391260, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX54MIA6n7siLaA-C_lU_ZI8Czy-8wks5qjZw-gaJpZM4GkHiR .

gsrohde commented 8 years ago

So shall I do this? Perhaps set it up on pecandev? All that's really needed is to dump the rows that will be referred to in the CSV file(s).

What is the terraref bety-test site for? Could that be used?

max-zilla commented 8 years ago

I think this is one of the primary reasons for the terraref bety-test site. Scott, if you're willing, you can do the dump+upload faster than I could.

dlebauer commented 8 years ago

bety-test is not just for development, but also for providing simulated data so that people can test algorithms against 5 years of high resolution data, etc.

so no, please don't use it. Its slow anyway.

But you could create the dump as 'bety-dev' database on the same bety6 server

gsrohde commented 8 years ago

@max-zilla I think I've copied enough data over to pecandev so you can use it to test. The URL is http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv.

max-zilla commented 8 years ago

@gsrohde thanks! I made an account and I'm testing things out now.

My first POST of: image

returned:

"errors": "No trait variable was found in the CSV file."

I can see the columns in the Variables table (sv_area, etc.) but I'm not sure about what I might need to add otherwise. You had mentioned:

In order to be recognized, the column heading must exactly match the value of the name column in some row of the variables table AND must appear as either a trait or a covariate in the trait_covariate_associations table. The values must be in the units specified in the variables table.

In the Bety UI I see Traits and Covariates under Data menu - should there already be associations for these variables? If not, can you briefly suggest what I might create, and whether the extractor should be responsible for creating traits/covariates as part of the upload if necessary?

gsrohde commented 8 years ago

@max-zilla Sorry I'm just getting back to you. I was out of town yesterday.

I'm not sure why you are getting this error. I see that, for example, sv_area is both the name of a variable in the variables table and that its id is matches the value of trait_variable_id in the trait_covariate_associations table. (You can see these associations in the Bety UI by clicking the Bulk Upload menu item and then clicking on View List of Recognized Traits.)

If you send me a text version of the CSV file you used, I can try this myself and debug if necessary.

max-zilla commented 8 years ago

no worries @gsrohde

I've attached avg_traits.csv (renamed to .txt so I could attach). I was using this to POST with my user key:

POST http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=<omitted>
Content-Type: text/csv

...with CSV file attached.

avg_traits.txt

gsrohde commented 8 years ago

I was able to insert these traits using this curl command:

curl -H "Content-Type: text/csv" -X POST --data-binary @avg_traits.csv "http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=<omitted>"

but not before removing duplicate "height" variables from the database and changing the CSV file by changing the date to 2014-06-23T16:55:57.625 (that is, replacing the space between the date and time with the letter "T"; see my first comment above from July 5).

I'm a little baffled by the error you got. Was that the full response string?

max-zilla commented 8 years ago

That's helpful. I'll try with those changes as a starting point - I believe the "T" should be in the PlantCV output timestamp, but I had pasted that into the CSV from elsewhere to test and I didn't include it. The --data-binary flag might also be significant.

The full response was a 400:

{
  "metadata": {
    "URI": "http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=GPNT0AIJleKIIxmj087WCmzAqkp1brY5jQwOrhvQ",
    "timestamp": "2016-08-30T14:32:48-05:00"
  },
  "errors": "No trait variable was found in the CSV file."
}
max-zilla commented 8 years ago

@gsrohde OK, it worked! I got the same 'no trait variable' error using Postman but it was defaulting to 'form data' - changing to binary and adjusting the timestamp and I got a 201 Created response.

This will be a slight adjustment to the extractor, but then we should be in good shape. Thanks.

max-zilla commented 8 years ago

Updated the code, will test the full stack and merge. https://github.com/terraref/computing-pipeline/pull/154

gsrohde commented 8 years ago

@max-zilla I only just now remembered that I was supposed to implement handling of the entity column. Currently, an anonymous entity is created for each row, and the traits in that row are associated with it. As I mentioned in my e-mail to you (quoted above in your note from Aug 4), we decided to call the column containing plant bar codes entity and use whatever name is there for the name of the entity.

@dlebauer I'm not clear exactly how this should work. Should there only be one entity for each bar code? Or should things work more like they do now, where each row in an uploaded file gets its own entity? There are no uniqueness constraints on the entities table, and as far as I can tell, none were ever planned, so there is no reason multiple entity rows can't use the same name. Moreover, it's always been my understanding that traits measurements would share a common entity only if they were on the same plant part, plant, plant stand etc. and were made at the same time.

dlebauer commented 8 years ago

@gsrohde

short answer For the case of the PlantCV data, please use select id from entities where name = 'barcode' to identify the entity_id in the traits table.

explanation entities do not have to be at the same time The idea is that they capture a level of replication that is of interest (plot, plant, organ etc) and allow a user to correlate multiple observations on the same 'entity'. The entities table has a field called 'parent_id'. This was designed to allow heirarchical nesting, e.g. of a leaf on a plant in a plot. So if the data to be inserted contains an identifier for entity, it would make sense to see if the entity already exists, and if it does, to either use the existing entity. Since we are capturing time in the traits table, we don't need a new entity for each time point.

Once we get to measuring a specific leaf and tracking it through time, it would make sense to create a new entity to track the leaf, with a parent_id that links to the plant.

max-zilla commented 8 years ago

Just tested the full stack:

Discovered this morning the extractor wasn't properly writing attributes from the file metadata (barcode, genotype, treatment, local_datetime) so the upload was failing. I've fixed the extractor and I'm going to merge my pull request. I think we can consider this task completed

This will be deployed as discussed in https://github.com/terraref/computing-pipeline/issues/147.

max-zilla commented 8 years ago

@gsrohde @dlebauer sorry, didn't see your new comments from yesterday. Am I OK to close this? Is that update something on BETY side?

gsrohde commented 8 years ago

@max-zilla I put the "entities" bit into a task in https://github.com/terraref/computing-pipeline/issues/124, so I think it's OK to close this.