terraref / reference-data

Coordination of Data Products and Standards for TERRA reference data
https://terraref.org
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Traits missing treatment_id #166

Open dlebauer opened 7 years ago

dlebauer commented 7 years ago

The following trait records have no treatment_id:

select trait, count(*) as n 
   from traits_and_yields_view_private 
   where treatment_id is null group by trait order by n desc
trait n
canopy_cover 1275670
95th_quantile_height 46095
height 22450
emergence_count 3048
leaf_stomatal_conductance 239
ambient_temperature 239

These should probably all be associated with the same 'Observational' treatment. But we need to identify why these are not being associated with treatments in the first place.

For canopy_cover and some or all of the height data, I think that extractors need to be updated by @ZongyangLi .

For the other variables, it is possible to add the treatment, but I want to make sure @NewcombMaria is aware of this.

TODO

dlebauer commented 7 years ago

Added experiments_treatments relationships for seasons 1, 2, 4 https://gist.github.com/dlebauer/24d16485b651b2684c1d4a605e8a0993

NewcombMaria commented 7 years ago

@dlebauer the web bulk upload requires a treatment at the 'variable-wide' step when the access level needs to be defined, and also if date was not entered then date must be defined. I have been entering "Control" for treatment so far. I believe the traits listed in the comment above (height, emergence_count, leaf_stomatal_conductance, and ambient_temperature) without treatment_id were entered using api upload. @gsrohde is that correct? Is "Control" an appropriate treatment designation, or should I be entering "Observational" treatment instead?

ghost commented 7 years ago

@gsrohde how long would this take to implement or is there a workaround you can suggest?

gsrohde commented 7 years ago

Fleshing out the above analysis with the query

SELECT
    v. NAME,
    u. NAME,
    LOGIN,
    COUNT (*) AS n,
    MIN (T . ID) AS "lowest id",
    MAX (T . ID) AS "highest id",
    MIN (T .created_at) AS "start date",
    MAX (T .created_at) AS "end date"
FROM
    traits T
JOIN variables v ON v. ID = T .variable_id
JOIN users u ON u. ID = T .user_id
WHERE
    treatment_id IS NULL
GROUP BY
    v. NAME,
    LOGIN,
    u."name"
ORDER BY
    n DESC;

yields

           name            |     name     |  login   |    n    | lowest id  | highest id |         start date         |          end date          
---------------------------+--------------+----------+---------+------------+------------+----------------------------+----------------------------
 canopy_cover              | Max Burnette | mburnet2 | 1276364 | 6000091938 | 6001796507 | 2017-04-13 14:12:23.629931 | 2017-07-27 18:32:12.122007
 95th_quantile_height      | Zongyang Li  | slb197   |   46095 | 6000028904 | 6000074998 | 2017-01-12 20:21:35.908395 | 2017-01-12 20:32:22.738289
 height                    | Zongyang Li  | slb197   |   22450 | 6000481098 | 6000505869 | 2017-05-09 21:53:00.575255 | 2017-05-09 21:59:29.861345
 emergence_count           | Scott Rohde  | srohde   |    3048 | 6000018709 | 6000028903 | 2016-11-10 17:47:31.017533 | 2016-11-10 22:33:40.545048
 ambient_temperature       | Scott Rohde  | srohde   |     239 | 6001787824 | 6001788300 | 2017-07-21 22:40:45.890149 | 2017-07-21 22:40:54.048118
 leaf_stomatal_conductance | Scott Rohde  | srohde   |     239 | 6001787823 | 6001788299 | 2017-07-21 22:40:45.690856 | 2017-07-21 22:40:54.033956
(6 rows)

The last two rows are the data I uploaded for @NewcombMaria last week (see https://github.com/terraref/reference-data/issues/162). The Bulk Upload Wizard requires a treatment, but I used the API, which does not. I happened also to notice that these rows aren't connected with a citation. Again, the Bulk Upload Wizard always requires a citation, but the API does not. @NewcombMaria, should the citation be Newcomb, Maria, 2016, "Field measurements from Maricopa Agricultural Center"? This citation has four associated treatments: "Control", "high density", "medium density", and "low density". @NewcombMaria, is "Control" the one I should use? (Note that the name "Control" is rather redundant since there is a boolean table column saying whether a particular treatment is a control.)

(As an aside, I notice the the Newcomb, 2017 citation has only one associated treatment, and it doesn't have a name. Please provide one. @dlebauer, should we require a name programmatically?)

The emergence_count rows were in connection with https://github.com/terraref/reference-data/issues/61. Again I used the API, which didn't complain that there was no treatment column in the CSV files. These at least do have citation information. @NewcombMaria, should these be "Control" as well?

@dlebauer Is there a question mark by to-do item 3 because you aren't sure if you want to require treatments be specified or because you are worried about how easy it will be to implement? Note that the GUI interfaces (the "Insert Trait" page and the Bulk Upload Wizard) do ensure traits have treatments. The API (and direct SQL inserts) don't.