terraref / computing-pipeline

Pipeline to Extract Plant Phenotypes from Reference Data
BSD 3-Clause "New" or "Revised" License
23 stars 13 forks source link

add site cultivars, treatments metadata to plot metadata #573

Open tcnichol opened 5 years ago

tcnichol commented 5 years ago

Description

Add site cultivar and site treatments data to sites metadata.

The change would be made here

https://github.com/terraref/terrautils/blob/master/terrautils/lemnatec.py#L118

This may be possible to accomplish using brapi endpoints, or else will require modification of betydb.

this is the branch i am working on for this issue:

https://github.com/terraref/terrautils/tree/add-cultivars-site-metadata

tcnichol commented 5 years ago

@max-zilla I had a few questions about this one.

At the point where we want to add the sites_cultivars and sites_treatments, we have a site_id but not an experiment id. If I had the experiment_id, there should be endpoints in brapi that will supply what we need.

So I see two ways of handling this.

The first would be that in terrautils, the experiments json is cached. I can add a method that takes a site_id and returns and experiment or experiment_id. That would require parsing that json for every site, which is the drawback.

The second way would be that when the sites json is created, the method iterates over each experiment, and then builds an object for each site. We could just put the experiment_id in with the sites. It's just adding a single id and not a full object.

Let me know if either of those are acceptable.

dlebauer commented 5 years ago

Would it help to have a view pre populated in the database?

max-zilla commented 5 years ago

The second option would probably be better, each extractor creates the cache on startup if it needs BETY stuff (we don't have a shared cache file like traitvis) so inserting the experiment into the site object and saving potentially many scans of the JSON later sounds more efficient.

tcnichol commented 5 years ago

K. I will implement the second then.

tcnichol commented 5 years ago

I have now changed the get_sites method in betydb in the following ways:

Each entry for a site now contains the corresponding experiment id. It also contains the cultivar information.

The cultivar information is obtained by the following:

Using the brapi v1/study/{studyDbId}/layouts endpoint, a dictionary is created that maps site ids to germplasm ids, and gets the corresponding information from the /germplasm endpoint in brapi.

I think that this can be further optimized by using the stored BETYDB_CULTIVARS instead of making a call to the brapi /germplasm enpoint. Will check into that next, but this means that the cultivar info can be added in lemnatec.py without making any calls there.

tcnichol commented 5 years ago

Both the cultivar and treatments information should now be contained in the sites retrieved from betydb. Some sites do not have cultivar or treatment data. Currently the value for those keys is 'no info available' but if there is a better value for those, I can change that.

If someone can take a look, I think this one is pretty close to done.

dlebauer commented 5 years ago

@tcnichol I can take a look if you can point to some of the records that are missing information.

tcnichol commented 5 years ago

Here is one https://brapi.workbench.terraref.org/brapi/v1/studies/6000000014/layouts https://brapi.workbench.terraref.org/brapi/v1/observationunits?studyDbId=6000000014

in layouts, there are entries for these 2 plots MAC Field Scanner Season 4 Range 6 Column 10 W/E and there is a corresponding entry in 'treatments' (but without the W/E)

However, this plot

MAC Field Scanner Season 4 Range 48 Column 11 W

doesn't have any corresponding entries in the /observationunits endpoint.

Since all the entries in observationunits have a corresponding entry in /layouts, but not all entries in /layouts have a corresponding entry in observationunits, are these control plots? What would be the best value to put under 'treatments' if there is no entry for the plot in observationunits?

dlebauer commented 5 years ago

it sounds like there are two things to address:

  1. there are missing records in treatments_sites that link the W/E plots to treatments. Is that the case?
  2. it isn't clear why MAC Field Scanner Season 4 Range 48 Column 11 W wouldn't have an observationunits endpoing, - since it is a plot, it is by definition an observation unit

Other thoughts:

tcnichol commented 5 years ago

I will check into 1 tomorrow.

I can create an issue for changing location_abbrevation to observationUnitName in brapi as well if that does not match the spec.

tcnichol commented 5 years ago

@dlebauer

I synced my local database with bety and did not see a sites_treatments table, but I was able to create something comparable with this sql.

create table experiments_sites_treatments as
select experiments_sites.site_id, experiments_sites.experiment_id, experiments_treatments.treatment_id
from experiments_sites
INNER JOIN experiments_treatments on experiments_treatments.experiment_id=experiments_sites.experiment_id

the next step was running this query

select experiments_sites_treatments.site_id, experiments_sites_treatments.experiment_id, experiments_sites_treatments.treatment_id,
treatments.id, treatments.name, treatments.definition
from experiments_sites_treatments, treatments
where experiments_sites_treatments.treatment_id = treatments.id
and experiments_sites_treatments.site_id= 6000007122;

Here are the results

siteandtreatments

So it looks like bety does have a treatment with a name/definition associated with this plot. Could this mean that there is something up with the brapi endpoint or the query it uses? I will check it and compare what it does with what I get here.

tcnichol commented 5 years ago

i posted in the brapi channel about pagination.

tcnichol commented 5 years ago

i think i figured out the issue with pagination, testing a fix in brapi locally with this.

tcnichol commented 5 years ago

https://github.com/terraref/brapi/pull/29

This pull request fixes the issue in brapi. With pagination working correctly, there should be an entry found for every plot.

tcnichol commented 5 years ago

@robkooper thanks for merging that one in.

Since @dlebauer pointed out a problem with the results returned by observationunits, I have created this issue.

https://github.com/terraref/brapi/issues/30

Will fix that endpoint and then fix the branch associated with this in terrautils to use the correct keys rather than 'location_abbrevation.'

tcnichol commented 5 years ago

https://github.com/terraref/brapi/pull/31

This pull request fixes location_abbrevation, and I have changed terrautils to take this change into account for the branch associated with this issue.

tcnichol commented 5 years ago

sample metadata as would be generated by

terrautils.lemnatec._get_sites

{"6000007362": {"sitename": "MAC Field Scanner Season 4 Range 14 Column 2 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI330182", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000819"}, "url": ""}, "notes": "sitename is the plot that contains the image centroid", "6000007408": {"sitename": "MAC Field Scanner Season 4 Range 15 Column 9 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI329301", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000558"}, "url": ""}, "6000007592": {"sitename": "MAC Field Scanner Season 4 Range 21 Column 5 E", "experiment_id": 6000000014, "treatments": {"definition": "Pre-plant Nitrogen fertilizer incorporated in the field", "id": "6000000020", "experiment_id": "6000000014"}, "cultivar": {"germplasmName": "PI563332", "species": "bicolor", "genus": "Sorghum", "germplasmDbId": "6000000903"}, "url": ""}, "6000007726": {"sitename": "MAC Field Scanner Season 4 Range 25 Column 8 E", "experiment_id": 6000000014, "treatments": "no info available", "cultivar": "no info available", "url": ""}}
tcnichol commented 5 years ago

(just a few keys, I didn't want to post something huge.)

tcnichol commented 5 years ago

pull request open

https://github.com/terraref/terrautils/pull/71

max-zilla commented 5 years ago

The PR is ready for review but the get_sites process runs extremely slowly. I will review with an eye toward possible caching or other improvements to make this viable in terrautils metadata cleaning pipeline.