terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

Define meta-data pipeline for PlantCV, not embedded in filename #36

Closed dlebauer closed 8 years ago

dlebauer commented 8 years ago

Ideally this will be embedded in meta-data uploaded with images

@nfahlgren @max-zilla will discuss

meta-data will be provided at time of upload

nfahlgren commented 8 years ago

I have some code I can reuse for this in the Clowder uploader we build for the Danforth Center data stream. Below are the metadata terms and formats I have proposed using for uploads to iPlant Data Store and BisQue. Happy to rename/reformat, add/subtract as necessary as the data standards committee develops a framework. Examples below are from our published Setaria experiment.

snapshot_id (internal ID number for the group of images taken for one plant at one time point) plant_barcode (internal alpha-numeric code for each plant) plant_age (units: days after planting) zoom (camera zoom setting, units: X optical zoom) perspective (camera position: side-view or top-view) rotation_angle (the angle the plant was rotated for the picture. For side-view images we have 0, 90, 180 and 270 degree images, units: degrees) camera_type (visible/RGB, near-infrared or photosystem II fluorescence) treatment (e.g. 100%: 217 ml water (48% VWC), 66.5%: 144.5 ml water (31% VWC), 33%: 72 ml water (14% VWC), or 0%: 0 ml water. These treatment values will be different for the sorghum experiments and could also include temperature and other environmental values) geometry (image pixel length/width dimensions, units: pixels) image_type (8-bit/color RGB non-interlaced, 8-bit grayscale non-interlaced, or 16-bit grayscale non-interlaced) imagedate (date-time of the snapshot: YYYY-MM-DD HH:MM:SS) experiment (e.g. Setaria pilot) experimenter (lead PI: Ivan Baxter) project (Drought response in Setaria) laboratory (Baxter Lab Donald Danforth Plant Science Center) growth_medium (MetroMix360 potting mix with Osmocote) filename (name of the image file) species (e.g. Setaria italica, Setaria viridis or Setaria viridis x Setaria italica) genotype (e.g. A10, B100, RIL102, RIL128, RIL133, RIL161, RIL187, RIL20, RIL70, RIL98) sha1 (checksum value for file integrity) publication_url (No value until published, maybe not needed?) publication_citation (No value until published, maybe not needed?)

dlebauer commented 8 years ago

Need to figure out how data will go to BETYdb.

either extract from images or have Noah populate BETYdb with experimental meta-data. Need to also look at ICASA meta-data formats so that we can support those.

Here are some ideas for metadata mapping to BETYdb:

plant_barcode, species, genotype, --> unique identifier / natural key

plant_barcode --> entities treatment --> treatment, managements experimenter --> citations.author experiment --> citations.title plant_age --> covariate planting_date --> managements growth_medium --> managements camera meta-data --> methods

max-zilla commented 8 years ago

I've started to path out some pseudo-queries to push these data into BETYdb. Here's an Excel file I created based on the images in snapshot42548 of the sample-data: sample_metadata.xlsx

I'm trying to sort not only what queries we'll need, but the order to execute them, and which (if any) should be done by Danforth prior to sending the image metadata that will be loaded (e.g. adding treatment descriptions).

@gsrohde I have some questions about best practices in terms of what dictates a suitable method name and description, for instance. I generally assume that we want to check for existence of some of these things before inserting duplicates, but not sure in all cases yet what should be used as criteria for 'uniqueness' if the primary key is a sequentially generated ID.

Here's the work in progress, if we pretend we're loading the first row of my sample file.

    INSERT INTO entities (id) 
    SELECT (500) 
    WHERE NOT EXISTS (SELECT 1 FROM entities WHERE id=500);
    INSERT INTO methods (name, description) 
    SELECT(<SomeName>, <zoom> <perspective> <rotation_angle> <camera_type>) 
    WHERE NOT EXISTS (SELECT 1 FROM methods WHERE name=SomeName);
    INSERT INTO treatments (name, definition) 
    SELECT ("66.5%", "144.5 ml water (31% VWC)") 
    WHERE NOT EXISTS (SELECT 1 FROM entities WHERE name="66.5%");
    INSERT INTO managements (mgmttype, level, units) 
    SELECT ("water", "144.5", "ml");
    INSERT INTO citations (author, year, title, journal, pg, url, pdf, doi) 
    SELECT ("Max Burnette", 2015, "Max test data", "Danforth unpublished data", 0, "www.danforthcenter.org", "www.danforrthcenter.org", 0) 
    WHERE NOT EXISTS (SELECT 1 FROM citations WHERE author="Max Burnette" and year=2015 and title="Max test data");
max-zilla commented 8 years ago

An example of best practices - the only constraint on treatments table is that ID, name and definition are not null, and ID will be automatically assigned if not provided. However, it seems plausible that two different experiments could use two different treatments, but want to give them the same name.

In this case, checking for an existing ID doesn't really make much sense.

For this particular problem my first reaction would be to check for redundancy in name+description and insert if there's not an existing record with those values, but there are similar questions for e.g. methods that may be more complex. Perhaps this is already a solved problem.

gsrohde commented 8 years ago

@max-zilla I may have sent you these before, but here's some documentation we have on contraints: https://docs.google.com/spreadsheets/d/1fJgaOSR0egq5azYPCP0VRIWw1AazND0OCduyjONH9Wk/edit?pli=1#gid=956483089 https://www.overleaf.com/2086241dwjyrd#/5297403/ Many of these constraints haven't been implemented, and some may have changed.

Also, some information about the BETY database semantics can be gleaned from the data entry workflow document here: https://www.authorea.com/users/5574/articles/6800/_show_article

Some random comments on your last comment:

entities: I know we said at the meeting entities should be used for bar codes, but this seems rather different from how they been have used up until now, which is as a way grouping trait measurements that were made on the same plant at the same time. (In nearly all cases, there is no information associated with an entity other than it's uniqueness (and the standard Rails timestamp columns created_at and updated_at)—the parent_id, name, and notes columns are usually blank.) I would double-check with @dlebauer that it make sense to use them to identify particular plants (regardless of measurement time).

In any case, it is important that the id column not be used for any semantic information—that it only serve as a surrogate key. So if barcode has any real-world significance (as a product barcode does), it shouldn't go into the id column. (It wasn't clear in your example if you meant 500 as a barcode.)

methods: Again, I see that the meeting notes have camera meta-data being mapped to methods. I had thought "method" meant experimental method, so I was baffled by this, but looking at the existing table entries, it does seem that these indeed do refer to measurement methods. But I would check with @dlebauer to make sure this mapping makes sense.

The naming has seemed rather arbitrary. Since you have 255 characters available, I would err on the side of being descriptive. The constraint documents suggest that the names should be unique per citation (i.e., (name, citation_id) should be unique (and each non-null!)), but we don't yet enforce this (except for the name being non-null).

treatment: We don't enforce unique names. But the Overleaf document suggests that if two rows have the same name value, they should not be associated with the same citation (via citations_treatments). The boolean attribute control should be non-null, but we don't yet enforce this. Also, the Overleaf document states that there should be only one "control" treatment for a given citation and site, but it's not clear exactly what this means (that is, how the association of a treatment with a site is delineated).

managements: One or more managements may be associated with a given treatment and they help to define that treatment. But a given management may be used to help define other treatments as well, which is why the association is many-many and not many-one.

I'm pretty sure the bulk upload script uses (atomic) transactions in cases where it is necessary to get the id of an inserted row and then use it to make an insertion into a join table.

`

robkooper commented 8 years ago

I think we should not be using SQL statements for this. We should be using the rest endpoints provided by BETY and if the right ones don't exist add them. We can not guarantee that the extractor has access to the database, most likely it will not.

max-zilla commented 8 years ago

@gsrohde is it correct that currently the API only supports search, not data entry? I see information about API queries in the Data Access page, but no mention in the Data Entry Workflow article.

I also took a look at lib\simple_search.rb to see how parameters are handled there. If we're willing to have an external script handle the logic of when to upload the data, a new API endpoint could be very simple - maybe even calling something like UPSERT to simplify, although thats PSQL 9.5 only. Essentially an endpoint to insert values into fields of some table that throws an exception if constraints aren't met. Then, a script at Danforth or an extractor at NCSA could make sure data is added in a sensible order.

gsrohde commented 8 years ago

@max-zilla As far as I know, the API currently only supports searching.

Re Rob's remark: It seems to me it would be OK to think in terms of SQL statements for now until we firm up exactly what we want. At that point, we could add the application interfaces we need to do what we need without opening up full SQL access.

The code in simple_search.rb is a bit convoluted and ugly, and I've often thought of changing it but have so far contented myself primarily with trying to document it. Just as a caveat. I wouldn't necessarily be using it as a model.

I'm not familiar with UPSERT but may look into it.

dlebauer commented 8 years ago

@gsrohde how can we document the API? Is there something like swagger (what Clowder uses) for Rails? On Wed, Dec 16, 2015 at 8:38 AM Scott Rohde notifications@github.com wrote:

@max-zilla https://github.com/max-zilla As far as I know, the API currently only supports searching.

Re Rob's remark: It seems to me it would be OK to think in terms of SQL statements for now until we firm up exactly what we want. At that point, we could add the application interfaces we need to do what we need without opening up full SQL access.

The code in simple_search.rb is a bit convoluted and ugly, and I've often thought of changing it but have so far contented myself primarily with trying to document it. Just as a caveat. I wouldn't necessarily be using it as a model.

I'm not familiar with UPSERT but may look into it.

— Reply to this email directly or view it on GitHub https://github.com/terraref/computing-pipeline/issues/36#issuecomment-165168770 .

robkooper commented 8 years ago

One other thing to think about is if we want to separate the API from the web part. In clowder we have a special subset called api that contains all the api calls. Here is a nice article: https://labs.kollegorna.se/blog/2015/04/build-an-api-now/

gsrohde commented 8 years ago

@dlebauer Ruby has RDoc, and the comments I have in many of the Rails files (simple_search.rb, for example) are RDoc-compatible. But we may want something with different features. I'll look at what Rob sent.

And don't forget that there is some extensive existing manually-written API documentation.

dlebauer commented 8 years ago

manually-written API documentation

This is useful to me, but @robkooper said developers need something more like https://clowder.ncsa.illinois.edu/clowder/assets/docs/api/index.html

Probably we can reuse some of the text from the data access tutorial. On Wed, Dec 16, 2015 at 8:54 AM Scott Rohde notifications@github.com wrote:

@dlebauer https://github.com/dlebauer Ruby has RDoc, and the comments I have in many of the Rails files (simple_search.rb, for example) are RDoc-compatible. But we may want something with different features. I'll look at what Rob sent.

And don't forget that there is some extensive existing manually-written API documentation.

— Reply to this email directly or view it on GitHub https://github.com/terraref/computing-pipeline/issues/36#issuecomment-165173164 .

dlebauer commented 8 years ago

@nfahlgren one task that remains here is to provide experiment-level metadata. I am not sure what is necessary, but things like planting_date, medium, container size, fertilizer / other treatments, irrigation_rate should be included.

nfahlgren commented 8 years ago

@dlebauer right now I added experiment-level metadata to the dataset itself. This is what I added for the pilot experiment, we can add more and edit what's there:

author: Noah Fahlgren growth_medium: MetroMix360 potting mix with 14-14-14 Osmocote title: Sorghum Pilot Experiment - Danforth Center Phenotyping Facility - 2014-05-27 project: TERRA-REF instrument: Bellwether Phenotyping Facility location: Donald Danforth Plant Science Center planting_date: 2014-05-27

dlebauer commented 8 years ago

What about

TinoDornbusch commented 8 years ago

Please continue evolving the experiment meta data structure. I will update here. Regarding experimental details I need input from experimentators in Maricopa.

TinoDornbusch commented 8 years ago

Any idea how to include the imaging script (text file) in the meta data?

nfahlgren commented 8 years ago

The metadata is currently uploaded as JSON but I need to update it to JSON-LD eventually. In the JSON-LD file I can provide links to the sources of the information. One new metadata value could be the processing script information with a link to the version (on GitHub) that was used.