cassava disease incidence traits should have values between 0 - 1

nmenda commented 7 years ago

there are many trials in cassavabase with values that are not proportion of plants , but rather a count of number of plants. e.g. https://www.cassavabase.org/breeders/trial/1598

We cannot update these values, since the total number of plants per plot at the time of measurement is not in the database, but this number should be in the original files.

Need to delete those phenotype rows, and reload the corrected values .

Some older trials from Chiedozie do not have this data, and the values will have to be deleted without replacement e.g. https://www.cassavabase.org/breeders/trial/135

Need to add for all the disease incidence traits a cvtermprop to check the values at the time of uploading phenotypes

nmenda commented 7 years ago

Many disease incidence rows with phenotype values > 1 Many of these of value of null or a string with some spaces. Also some '0' values were loaded as 0.0 or 0.00 (though this is not a big deal - when casting the value as numeric it's all read as 0 )

select distinct dbxref.accession, cvterm.name , count(phenotype_id) from phenotype join cvterm on cvterm_id = observable_id join dbxref using (dbxref_id) where cvterm.name ilike '%incidence%evaluation%' AND value not like '0%' AND value not like '1' group by dbxref.accession, cvterm.name order by accession ;

accession | name | count -----------+-----------------------------------------------------------------+------- 0000178 | cassava bacterial blight incidence 3-month evaluation | 33749 0000179 | cassava bacterial blight incidence 6-month evaluation | 44570 0000180 | cassava bacterial blight incidence 9-month evaluation | 8114 0000187 | cassava green mite incidence first evaluation | 81915 0000188 | cassava green mite incidence second evaluation | 53537 0000195 | cassava mosaic disease incidence 1-month evaluation | 20268 0000196 | cassava mosaic disease incidence 3-month evaluation | 20541 0000197 | cassava mosaic disease incidence 9-month evaluation | 1280 0000198 | cassava mosaic disease incidence 6-month evaluation | 17261 0000200 | cassava mosaic disease incidence 12-month evaluation | 1114 0000202 | cassava brown streak disease root incidence 12-month evaluation | 561 0000208 | cassava brown streak disease leaf incidence 3-month evaluation | 3269 0000209 | cassava brown streak disease leaf incidence 6-month evaluation | 5239 0000210 | cassava brown streak disease leaf incidence 9-month evaluation | 18 0000211 | cassava bacterial blight incidence 12-month evaluation | 4638

nmenda commented 7 years ago

Phenotypes from old cassava trials with disease incidence values that need to be deleted - cannot be recovered from the original files

accession | name | min | max | prop_min | prop_max | count -----------+----------------------------------------------+-------+--------+----------+----------+------- 0000037 | Cassava bacterial blight incidence | 0.00 | 177.78 | 1 | 9 | 1712 0000038 | Cassava anthracnose disease incidence | 0.00 | 125.00 | 3 | 12 | 1711 0000039 | Cassava mosaic disease incidence | 0 | 214.29 | 1 | 9 | 2076 0000040 | Cassava brown streak disease leaf incidence | 0 | 100 | 3 | 6 | 306 0000122 | Cassava green mite incidence | 2.63 | 100.00 | 6 | 9 | 99

0000040 and 0000122 - check if values are % instead of proportion

nmenda commented 7 years ago

@aco46 is working on the NaCRRI and NRCRI trials, here are the trials with potential issues

phenotype_rows | min | ----------------+------+-------- 904 | 0 | 100.0 | 2983 | 0.0 | 100.0 | 467 | 0.0 | 100.0 | 1277 | 0.0 | 100.0 | 3315 | 0.0 | 100.0 | 6099 | 0.0 | 100 | 3538 | 0 | 100 | 300 | 0 | 0.931034482758621 | 256 | 0 | 100 | 256 | 0 | 100 | 920 | 0.00 | 150.00 | 1204 | 0.00 | 225.00 | 216 | 0.00 | 100.00 | 56 | 0.00 | 29.17 | 280 | 0.00 | 100.00 | 4847 | 0.00 | 100.00 | 456 | 0.00 | 80.00 | 375 | 0.00 | 100.00 | 396 | 0.00 | 100.00 | 676 | 1 | 20 | 26470 | 0.00 | 100.00 | 74576 | 0.00 | 100.00 | 14484 | 0.00 | 100.00 | 9981 | 0.00 | 100.00 | 6270 | 0.00 | 100 | 972 | 0.00 | 32.00 | 5723 | 0.00 | 5.00 | 480 | 0.00 | 27.00 | 792 | 0.00 | 21.00 | 1253 | 0.00 | 19.00 | 4078 | 0.00 | 20.00 | 360 | 0.00 | 20.00 | 672 | 0.00 | 100.00 | 2048 | 0.00 | 19.00 | 1042 | 0.00 | 22.00 | 289 | 0.00 | 100.00 | 982 | 0.00 | 18.00 | 503 | 0.00 | 20.00 | 480 | 0.00 | 100.00 | 432 | 0 | 30 | max | project_id | name | name
-----------+------------+------------------------------------------------+-------- 143 | Uganda Cassava Training Population NaCRRI_2012 | NaCRRI 461 | Uganda Cassava Training Population Kasese_2013 | NaCRRI 551 | Within Family Prediction Seedling Trial_2013 | NaCRRI 1632 | Uganda Cassava Training Population Kasese_2014 | NaCRRI 1647 | Uganda Cassava Training Population Ngetta_2014 | NaCRRI 1650 | Uganda Cassava Training Population Ngetta_2013 | NaCRRI 1791 | Uganda Training Population NaCRRI_2014 | NaCRRI 1797 | Degeneration Trial_Arua_2015 | NaCRRI 132 | 10uytUM | NRCRI 133 | 11uyt10RepeatUM | NRCRI 135 | 10cetNR10seriesUM | NRCRI 136 | 10cetNR10BseriesUM | NRCRI 137 | 09pytIgariam | NRCRI 138 | 09pytOtobi | NRCRI 139 | 09pytUmudike | NRCRI 145 | 11cetNR11seriesUM | NRCRI 155 | 11aytNR09seriesUM | NRCRI 156 | PYT 2010 | NRCRI 157 | 10uytCIATsetAUM | NRCRI 190 | 11clonal63cpUM | NRCRI 552 | 12nextgen500tp1UM | NRCRI 1597 | 13nextgen518tp1UM | NRCRI 1598 | 13nextgen518tp1OT | NRCRI 1599 | 13nextgen518tp1KA | NRCRI 1600 | 13nextgen489tp2UM | NRCRI 1603 | 13ayt25nr11UM | NRCRI 1618 | 13clonal261nr13UM | NRCRI 1619 | 13uyt15nr06-09UM | NRCRI 1620 | 13pyt20nr12UM | NRCRI 1621 | 13mp70popfOT | NRCRI 1623 | 13mp259popdOT | NRCRI 1624 | 13mp17popfUM | NRCRI 1625 | 13cmd35gpOT | NRCRI 1626 | 13mp144popdUM | NRCRI 1627 | 13cmd60gpUM | NRCRI 1628 | 12cmd36gpOT | NRCRI 1629 | 12cmd75gpUM | NRCRI 1630 | 11cmd48gpUM | NRCRI 1642 | 15cb28cgmUM | NRCRI 2820 | 14mlt12hbc_UM | NRCRI

nmenda commented 7 years ago

green mite incidence traits were loaded with integers between 1-5 , looks like a scale e.g. some IITA trials :

select distinct count(phenotype_id) as phenotype_rows, min( cast(phenotype.value as numeric )), max(cast(phenotype.value as numeric)) , pname.project_id, pname.name, bprogram.name from phenotype join cvterm on cvterm_id = observable_id join nd_experiment_phenotype using (phenotype_id) join nd_experiment_project using (nd_experiment_id) join project as pname using (project_id) join project_relationship on project_relationship.subject_project_id = pname.project_id join project as bprogram on bprogram.project_id = project_relationship.object_project_id where cvterm.name ilike '%green%incidence%' AND phenotype.value not like ' %' AND phenotype.value not like '-' and phenotype.value not like '#%' AND phenotype.value not like '%/%' AND bprogram.name ilike 'IITA%' group by pname.project_id, pname.name , bprogram.name order by bprogram.name , pname.project_id ;

phenotype_rows | min | max | project_id | name | name ----------------+-----+-----+------------+-----------------------------+------ 112 | 1.0 | 5.0 | 182 | 11ayt14yrt8IB | IITA 214 | 1.0 | 5.0 | 196 | 11ayt27yrt9IB | IITA 206 | 2.0 | 4.0 | 200 | 11uyt25sgIB | IITA 216 | 2.0 | 3.0 | 213 | 12ayt27yrtIB | IITA 480 | 2.0 | 5.0 | 214 | 12ayt20yrtIB | IITA 240 | 2 | 5 | 216 | 12ayt30yrt11IB | IITA 256 | 2.0 | 3.0 | 217 | 12ayt32mixedIB | IITA 380 | 2.0 | 4.0 | 218 | 12ayt33mixedpdIB | IITA 277 | 2.0 | 3.0 | 219 | 12ayt35ppdIB | IITA 274 | 1.0 | 4.0 | 220 | 12ayt35yrt10IB | IITA 296 | 2.0 | 4.0 | 221 | 12ayt37mixedIB | IITA 312 | 2.0 | 5.0 | 222 | 12ayt39yrt12IB | IITA 724 | 2.0 | 5.0 | 223 | 12ayt66pdwrtIB | IITA 334 | 2.0 | 5.0 | 224 | 12pyt86yrtIB | IITA

aafolabi commented 7 years ago

I have checked some of the trials listed. These are cgm severity scores. We score it twice ( Beginning and the end of dry season) as cgm1 and cgm2.

nmenda commented 7 years ago

looks like it will be safer to reload the NRCRI trials that have issues with disease incidence and severity for these 19 trials 132,133,135,136,137,138,139,143,145,146,149,150,151,152,155,156,157,159,160

nmenda commented 7 years ago

Look at the 9 disease incidence traits that need to be mapped if possible to the relevant variables: accession | name | rel_type
-----------+-----------------------------------------------------------------+------------- 0000037 | Cassava bacterial blight incidence | is_a 0000038 | Cassava anthracnose disease incidence | is_a 0000039 | Cassava mosaic disease incidence | is_a 0000040 | Cassava brown streak disease leaf incidence | is_a 0000089 | Cassava brown streak disease root incidence | is_a 0000122 | Cassava green mite incidence | is_a 0000339 | Cassava Mealy Bug Incidence | is_a 0000443 | Red Spider Mite Incidence | is_a 0000464 | Spiraling Whitefly Incidence | is_a

then do the same for the disease severity traits:

https://github.com/solgenomics/sgn/issues/771#issuecomment-267692736

nmenda commented 7 years ago

Check the values we have for the 24 incidence trait variables, convert from % to 0-1 proportion

nmenda commented 7 years ago

Add cvtermprop trait_maximum = 1 and trait_minimum = 0 to the 24 incidence traits

nmenda commented 7 years ago

add the max, min condition to https://github.com/solgenomics/sgn/blob/master/lib/CXGN/Phenotypes/StorePhenotypes.pm#L245

and to

https://github.com/solgenomics/sgn/blob/master/lib/CXGN/Fieldbook/TraitInfo.pm#L88

nmenda commented 7 years ago

https://github.com/solgenomics/sgn-home/blob/master/naama/iita_disease_incidence_trials.tab