Closed dlebauer closed 8 years ago
Some related discussion in #36 regarding metadata from the LemnaTec source into BETYdb.
@gsrohde
Noah defined the output csv file in terraref/computing-pipeline/issues/31#issuecomment-215169871 as:
plant_id,datetime,treatment,genotype,group,date,dap,solidity,outlier,sv_area,tv_area,extent_x,extent_y,height_above_bound,fw_biomass,dw_biomass,height_width_ratio,tiller_count,wue
Dp2AB000127,1386431350,0,B100,B100-0,2013-12-07 09:49:10,11.4091435185185,0.478404862032,False,2086.72707778917,2955.32688355849,1.98029381595354,1.26862572584524,1.26862572584524,-0.192056152391855,-0.0399180898450307,0.640625,3.81555081933358,11.465533394446
which can be mapped to BETYdb as:
plant_id --> entity datetime treatment genotype --> cultivar group --> ?? date --> dap --> managements (a single entry w/ planting_date = date - days(dap))
the following are traits:
solidity,outlier,sv_area,tv_area,extent_x,extent_y,height_above_bound,fw_biomass,dw_biomass,height_width_ratio,tiller_count,wue
@nfahlgren can you please define these traits and the methods used to compute them? I've made a spreadsheet that you can edit here: https://docs.google.com/spreadsheets/d/13AiwuzlFNBxV1uEXnkQ4aEg4r5WEbxRFTbocxlISL00/edit?usp=sharing
variables --> name, units, definition methods --> name, description, citation
@dlebauer @gsrohde I filled out the table more, can you guys take a look and let me know what you think?
@nfahlgren can you send a sample csv file + metadata to Scott
Is this accessible to you guys? http://141.142.209.122/clowder/files/574d17f3e4b0efbe2dc4adf1
@nfahlgren after I logged in, I was able to download the file. They'll need an account to download.
Here's the contents of the file also:
plant_barcode,genotype,treatment,imagedate,sv_area,tv_area,hull_area,solidity,height,perimeter
Fp001AA006740-L,BTx642,100%: 217 ml water (47.6% VWC),2014-06-23 16:55:57.625,144539.25,285574.0,1330688.875,0.116381346163,1665,16802.1037833
@gsrohde: for the pilot experiment, the genotypes included were:
BTx623
BTx642
Tx7000
Tx430
The treatments were:
100%: 217 ml water (47.6% VWC)
80%: 173.6 ml water (37.5% VWC)
60%: 130.2 ml water (27.3% VWC)
40%: 86.8 ml water (17.2% VWC)
@nfahlgren
Below I’ve listed the column headings used in the sample you sent, together with the sample value you provided.
After each key-value pair, I list the column heading the should be used (if known) along with some further discussion.
After this, I discuss what other information must be included and what other information you may wish to include.
(@dlebauer Please read this over and see if there are other considerations I have neglected.)
I’m assuming there is a functional dependency whereby genotype determines barcode. Thus, this column need not be included and will be ignored if it is. (Currently, the API allows extraneous columns. This may change, or at least a warning may be returned if extraneous columns are included.)
Use cultivar
. Each value in this column must match some value of the name
column in the cultivars
table.
The column name is correct. Each value must exactly match some value of the name
column in the treatments
table.
Use utc_datetime
if the time is given in UTC; use local_datetime
if the time is given in local (site) time. In the former case, the letter Z
must appear directly following the last digit of the time (no space). In the later case, a site
column must be included, giving the name of the site at which the data was collected, and that site must have a value in the column time_zone
. In both cases, the space between the date and the time in each column value must be replaced by the letter T
. (If this requirement proves too burdensome, I will consider changing the API to allow a space here.)
I’m assuming these six columns are all either names of trait variables or names of trait covariates. In order to be recognized, the column heading must exactly match the value of the name
column in some row of the variables
table AND must appear as either a trait or a covariate in the trait_covariate_associations
table. The values must be in the units specified in the variables table.
species
: Since in general cultivar names are guaranteed to be unique only within a given species, the name of the species must be provided. The value should exactly match the column value of scientificname
in some row of the species
table.access_level
: A column specifying who should have access to the data must be included. The allowable values are 1, 2, 3, and 4, corresponding to data access levels of “Restricted”, “Internal EBI & Collaborators”, “External Researcher”, and “Public”. @dlebauer probably can tell you what value is appropriate here.citation_author
, citation_year
, citation_title
, and citation_doi
. The data in the column(s) provided must uniquely determine a row in the citations
table.site
: Each value should match the value of column sitename
in some row of the sites
table. As noted above, site
must be included if the column local_datetime
is used.Note that I am considering changing the API to allow values that are constant for all of the data in the CSV file to be specified in the API URL's query string so that it needn’t be repeated in every row of the table. For example, if I implement this, a user could specify that all the data in the file is for Sorghum bicolor by including species=Sorghum+bicolor
in the query string instead of including a species
column in the CSV file having the value “Sorghum bicolor” in every single row. I’m also considering setting a default value for access_level
so that it doesn’t have to be specified if the default value is acceptable.
@gsrohde I've added the citation, site, treatments, and cultivars to terraref.ncsa.illinois.edu/bety
The sitename is "Danforth Plant Science Center Bellweather Phenotyping Facility"
@nfahlgren please sign up for an account on that site and check that these are okay / or edit to add additional information:
Presumably these are all Sorghum bicolor (?)
@gsrohde please add sv_area, tv_area, hull_area, solidity, height, and perimeter to the variables table. They are defined in this google doc.
@dlebauer
Done. I mapped columns as follows:
name --> name units --> units definition --> description comments --> notes min_value --> min max_value --> max (set to +Infinity if not given)
I left type
blank, since the table column doesn't refer to the data type as the Google doc table does.
I changed cm^2
to cm2
in line with CF Guidelines.
I didn't fill in standard_name
or standard_units
.
I left the units for solidity blank, but perhaps we should use "fraction" as we do elsewhere. Also, even though min_value and max_value are left blank in the table, I assumed they should be 0 and 1, respectively.
@dlebauer I signed up for a BETYdb account. The information you added looks good. I will have to check on the additional information about the sorghum varieties used in the pilot.
@gsrohde you are correct about plant_barcode, we don't need it in the database (redundant with cultivar and treatment). For the column mappings above (e.g. genotype => cultivar), do we need to change the nomenclature in Clowder or is this just how they are being mapped? Happy to change it on the Clowder end.
@gsrohde the current imagedate is in local_datetime, so we would need to add the timezone information to the Danforth Center site entry. If UTC is preferred I can have the extractor convert it.
@nfahlgren The plant barcode could go in the database (in the cultivar notes) if desired, though it looks like David hasn't done this as yet. My point was that it doesn't need to be in the CSV since it needs only to be entered once for each cultivar.
I think I would prefer keeping the name of the column heading "cultivar" since I imagine for some users of the API, this name will seem more apt. I could make "genotype" a synonym and accept either if this seems preferable to changing it on the Clowder end.
Local datetime is fine; there's no need to convert it.
@dlebauer Is there a revised CSV file that is ready for me to upload?
@nfahlgren has the format changed since https://github.com/terraref/computing-pipeline/issues/33#issuecomment-228118036 ?
Not yet, but I think I need to update some column names to better match the BETYdb API.
Director, Bioinformatics Core Facility Donald Danforth Plant Science Center 975 North Warson Road St. Louis, MO 63132 Email: nfahlgren@danforthcenter.org Phone: 314-587-1676
On Wed, Jul 13, 2016 at 3:45 PM, David LeBauer notifications@github.com wrote:
@nfahlgren https://github.com/nfahlgren has the format changed since #33 (comment) https://github.com/terraref/computing-pipeline/issues/33#issuecomment-228118036 ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-232481122, or mute the thread https://github.com/notifications/unsubscribe/AF8dpD7y71qn8lvB7ifLAKTLb0FBtgWqks5qVU5kgaJpZM4GkHiR .
@nfahlgren not sure which column names you are referring to but the most important thing about the column names is that they don't change.
In principle, the output of PlantCV can be independent of where it is going. Any code that is BETYdb-specific can be executed independent of PlantCV.
If would be great if all extractors wrote to a standard format but we don't currently have one (for traits).
That works too, if it's best to leave them alone and have a translator instead.
Noah
On Jul 13, 2016, at 6:13 PM, David LeBauer notifications@github.com wrote:
@nfahlgren not sure which column names you are referring to but the most important thing about the column names is that they don't change.
In principle, the output of PlantCV can be independent of where it is going. Any code that is BETYdb-specific can be executed independent of PlantCV.
If would be great if all extractors wrote to a standard format but we don't currently have one (for traits).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Dependant on extractor pull request
@max-zilla : is there an extractor for meta-data in place yet? Issue #? @dlebauer : who should write the translator for the files?
@nfahlgren @rachelshekar I just merged the extractor pull request. I wasn't sure if the discussion about PlantCV metadata mapping was finalized but I'll start up on this first thing next week if I don't hear otherwise.
@nfahlgren @rachelshekar @dlebauer starting work on this under this branch: https://github.com/terraref/computing-pipeline/tree/plantcv_bety_extractor
edit: Actually this maybe doesn't need to be a separate extractor... sounds like the basic step is simply to upload the output CSV from the PlantCV extractor to the BETYdb upload endpoint. This could be done in the PlantCV extractor at the same time the CSV is generated for upload into Clowder.
@gsrohde @nfahlgren can we think of reasons why we'd want the BETY submission piece separate from the PlantCV processing piece? I'm not coming up with any.
After discussion with @gsrohde and @dlebauer it sounds like a generic BETYdb extractor is preferable. This could check for any CSV files or metadata that match some criteria (e.g. recognized column names) to insert into BETYdb. We can start very specific (like looking for avg_traits.csv from @nfahlgren's PlantCV extractor) and generalize as we go forward.
Some comments from email traffic with Scott: Here’s a summary of what’s needed that might be more succinct and digestible:
Heading name changes:
imagedate —> local_datetime (or perhaps utc_datetime if that’s what it is) genotype —> cultivar plant_barcode —> entity: Per conversation with David this morning, this will become the name for the entity associated with the row. I will need to implement handling of this. Currently, an anonymous entity is generated for each row in order to group trait measurements made at the same time on the same plant or plant group.
Missing required columns:
access_level
Missing recommended columns:
species site (This is required if you want to use local time for the datetime column.)
Other columns we may wish to include:
citation_doi OR citation_author, citation_year, and citation_title
“treatment” is a recognized column name, and according to David, all of the other columns match bona fide trait variable names in the target database, so they are all fine. The CSV upload machinery takes care of looking up metadata ids, so you don’t need to do this in the extractor.
Since it seems likely that accesslevel, species, and site (and citation*) will have the same value for every row in the table, it might be worthwhile implementing a way of specifying this globally per file. (Note that there already is a way of specifying metadata globally for data submitted in JSON or XML format, so I don’t anticipate that extending this to CSV would be too difficult.) For access_level, we could instead or in addition give it a default value so that in wouldn’t need to be included if the default is acceptable. David, do you have an opinion about this? On the other hand, since it should be easy to generate the missing columns in the extractor (or elsewhere), perhaps this isn’t an important feature.
I would love to work with Luigi who has written some code to easily create a mapping between a CSV file and other names.
@gsrohde @nfahlgren I am modifying the PlantcvClowderIndoorAnalysis script slightly based on Scott's comments above.
access_level is 2, this means for internal collaborators On Mon, Aug 22, 2016 at 8:47 AM Max Burnette notifications@github.com wrote:
@gsrohde https://github.com/gsrohde @nfahlgren https://github.com/nfahlgren I am modifying the PlantcvClowderIndoorAnalysis script slightly based on Scott's comments above.
- renamed genotype, plant_barcode columns
ready to rename imagedate - @nfahlgren https://github.com/nfahlgren
do you know if you're writing UTC or local time?
what is a suitable default value for access_level?
species and site we could hardcode for now, if they already exist in betydb?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-241417526, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX5xjq6lT5idAgq5fLdQdQc_Ohs1bSks5qiaiEgaJpZM4GkHiR .
@max-zilla it's local time. @dlebauer added the site (https://terraref.ncsa.illinois.edu/bety/sites/6000000866) and species (https://terraref.ncsa.illinois.edu/bety/species/2588) to our BETYdb instance.
@max-zilla Note that the columns site.sitename
and species.scientificname
are the effective keys that are used when looking up a match for the corresponding data in the CSV file (not the id columns), so in this case, the CSV file should have Danforth Plant Science Center Bellweather Phenotyping Facility
in the site
column and Sorghum bicolor
in the species column. (If and when we use JSON or XML input files, there is more flexibility in the way lookups are done.)
OK, my branch is here: https://github.com/terraref/computing-pipeline/compare/plantcv_extractor_updates
This changes the fields/default values as suggested above. Only change now I believe is to add a config parameter with a BETYdb URL, which if given will be used to POST the CSV into BETY.
@gsrohde do we know the URL/API endpoint to use already?
The URL for the API endpoint is <host>/api/beta/traits.csv
where I think host is to be https://terraref.ncsa.illinois.edu/bety/
(that's the host that Noah references above). You need to set the API key parameter in a query string (?key=...
). See the documentation at https://pecan.gitbooks.io/betydbdoc-dataentry/content/trait_insertion_api.html
. One thing that isn't (yet) mentioned there is that if you are posting a CSV file, you should explicitly set the content type in the post request. For example, if you are using curl
, add the option -H "Content-Type: text/csv"
. (Many files will work without this, but if the file happens to include a %
character, things are likely to go horribly awry since it will likely default to Content-Type: application/x-www-form-urlencoded
and the %
character will be interpreted as the beginning of a URL escape sequence.)
Please let me know before you actually do a post to a real site. You are of course free (and encouraged) to post to any test copy. (You will have to put the relevant metadata in place first for the post to succeed.)
OK. I've added BETYdb uploading functionality to the branch I listed shortly above this one. Have not yet tested; will create a pull request once I do.
Pull request created: https://github.com/terraref/computing-pipeline/pull/154
@gsrohde, is there a test copy of Bety that already has the relevant metadata to test this, and if not can you briefly list which metadata needs to be in place first? Thanks.
Only way to make a test copy would be to dump bety6 On Thu, Aug 25, 2016 at 8:44 AM Max Burnette notifications@github.com wrote:
Pull request created: #154 https://github.com/terraref/computing-pipeline/pull/154
@gsrohde https://github.com/gsrohde, is there a test copy of Bety that already has the relevant metadata to test this, and if not can you briefly list which metadata needs to be in place first? Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/terraref/computing-pipeline/issues/33#issuecomment-242391260, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcX54MIA6n7siLaA-C_lU_ZI8Czy-8wks5qjZw-gaJpZM4GkHiR .
So shall I do this? Perhaps set it up on pecandev? All that's really needed is to dump the rows that will be referred to in the CSV file(s).
What is the terraref bety-test site for? Could that be used?
I think this is one of the primary reasons for the terraref bety-test site. Scott, if you're willing, you can do the dump+upload faster than I could.
bety-test is not just for development, but also for providing simulated data so that people can test algorithms against 5 years of high resolution data, etc.
so no, please don't use it. Its slow anyway.
But you could create the dump as 'bety-dev' database on the same bety6 server
@max-zilla I think I've copied enough data over to pecandev so you can use it to test. The URL is http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv.
@gsrohde thanks! I made an account and I'm testing things out now.
My first POST of:
returned:
"errors": "No trait variable was found in the CSV file."
I can see the columns in the Variables table (sv_area, etc.) but I'm not sure about what I might need to add otherwise. You had mentioned:
In order to be recognized, the column heading must exactly match the value of the name column in some row of the variables table AND must appear as either a trait or a covariate in the trait_covariate_associations table. The values must be in the units specified in the variables table.
In the Bety UI I see Traits and Covariates under Data menu - should there already be associations for these variables? If not, can you briefly suggest what I might create, and whether the extractor should be responsible for creating traits/covariates as part of the upload if necessary?
@max-zilla Sorry I'm just getting back to you. I was out of town yesterday.
I'm not sure why you are getting this error. I see that, for example, sv_area
is both the name of a variable in the variables table and that its id is matches the value of trait_variable_id
in the trait_covariate_associations
table. (You can see these associations in the Bety UI by clicking the Bulk Upload
menu item and then clicking on View List of Recognized Traits
.)
If you send me a text version of the CSV file you used, I can try this myself and debug if necessary.
no worries @gsrohde
I've attached avg_traits.csv (renamed to .txt so I could attach). I was using this to POST with my user key:
POST http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=<omitted>
Content-Type: text/csv
...with CSV file attached.
I was able to insert these traits using this curl command:
curl -H "Content-Type: text/csv" -X POST --data-binary @avg_traits.csv "http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=<omitted>"
but not before removing duplicate "height" variables from the database and changing the CSV file by changing the date to 2014-06-23T16:55:57.625
(that is, replacing the space between the date and time with the letter "T"; see my first comment above from July 5).
I'm a little baffled by the error you got. Was that the full response string?
That's helpful. I'll try with those changes as a starting point - I believe the "T" should be in the PlantCV output timestamp, but I had pasted that into the CSV from elsewhere to test and I didn't include it. The --data-binary flag might also be significant.
The full response was a 400:
{
"metadata": {
"URI": "http://pecandev.igb.illinois.edu/beta/api/beta/traits.csv?key=GPNT0AIJleKIIxmj087WCmzAqkp1brY5jQwOrhvQ",
"timestamp": "2016-08-30T14:32:48-05:00"
},
"errors": "No trait variable was found in the CSV file."
}
@gsrohde OK, it worked! I got the same 'no trait variable' error using Postman but it was defaulting to 'form data' - changing to binary and adjusting the timestamp and I got a 201 Created response.
This will be a slight adjustment to the extractor, but then we should be in good shape. Thanks.
Updated the code, will test the full stack and merge. https://github.com/terraref/computing-pipeline/pull/154
@max-zilla I only just now remembered that I was supposed to implement handling of the entity
column. Currently, an anonymous entity is created for each row, and the traits in that row are associated with it. As I mentioned in my e-mail to you (quoted above in your note from Aug 4), we decided to call the column containing plant bar codes entity
and use whatever name is there for the name of the entity.
@dlebauer I'm not clear exactly how this should work. Should there only be one entity for each bar code? Or should things work more like they do now, where each row in an uploaded file gets its own entity? There are no uniqueness constraints on the entities
table, and as far as I can tell, none were ever planned, so there is no reason multiple entity rows can't use the same name. Moreover, it's always been my understanding that traits measurements would share a common entity only if they were on the same plant part, plant, plant stand etc. and were made at the same time.
@gsrohde
short answer For the case of the PlantCV data, please use select id from entities where name = 'barcode'
to identify the entity_id in the traits table.
explanation entities do not have to be at the same time The idea is that they capture a level of replication that is of interest (plot, plant, organ etc) and allow a user to correlate multiple observations on the same 'entity'. The entities table has a field called 'parent_id'. This was designed to allow heirarchical nesting, e.g. of a leaf on a plant in a plot. So if the data to be inserted contains an identifier for entity, it would make sense to see if the entity already exists, and if it does, to either use the existing entity. Since we are capturing time in the traits table, we don't need a new entity for each time point.
Once we get to measuring a specific leaf and tracking it through time, it would make sense to create a new entity to track the leaf, with a parent_id that links to the plant.
Just tested the full stack:
Discovered this morning the extractor wasn't properly writing attributes from the file metadata (barcode, genotype, treatment, local_datetime) so the upload was failing. I've fixed the extractor and I'm going to merge my pull request. I think we can consider this task completed
This will be deployed as discussed in https://github.com/terraref/computing-pipeline/issues/147.
@gsrohde @dlebauer sorry, didn't see your new comments from yesterday. Am I OK to close this? Is that update something on BETY side?
@max-zilla I put the "entities" bit into a task in https://github.com/terraref/computing-pipeline/issues/124, so I think it's OK to close this.
After phenotypes are extracted from PlantCV pipeline, plant-level traits should be inserted into BETYdb, Traits described in Fahlgren et al 2015.
Meta-data:
Data (traits):