solgenomics / sgn

The code behind the Sol Genomics Network, Cassavabase and other Breedbase websites
https://solgenomics.net
MIT License
66 stars 35 forks source link

cassava disease incidence traits should have values between 0 - 1 #780

Closed nmenda closed 1 year ago

nmenda commented 7 years ago

there are many trials in cassavabase with values that are not proportion of plants , but rather a count of number of plants. e.g. https://www.cassavabase.org/breeders/trial/1598

We cannot update these values, since the total number of plants per plot at the time of measurement is not in the database, but this number should be in the original files.

Need to delete those phenotype rows, and reload the corrected values .

Some older trials from Chiedozie do not have this data, and the values will have to be deleted without replacement e.g. https://www.cassavabase.org/breeders/trial/135

Need to add for all the disease incidence traits a cvtermprop to check the values at the time of uploading phenotypes

nmenda commented 7 years ago

Many disease incidence rows with phenotype values > 1 Many of these of value of null or a string with some spaces. Also some '0' values were loaded as 0.0 or 0.00 (though this is not a big deal - when casting the value as numeric it's all read as 0 )


select distinct dbxref.accession, cvterm.name , count(phenotype_id) from phenotype join cvterm on cvterm_id = observable_id join dbxref using (dbxref_id) where cvterm.name ilike '%incidence%evaluation%' AND value not like '0%' AND value not like '1' group by dbxref.accession, cvterm.name order by accession ;

accession | name | count -----------+-----------------------------------------------------------------+------- 0000178 | cassava bacterial blight incidence 3-month evaluation | 33749 0000179 | cassava bacterial blight incidence 6-month evaluation | 44570 0000180 | cassava bacterial blight incidence 9-month evaluation | 8114 0000187 | cassava green mite incidence first evaluation | 81915 0000188 | cassava green mite incidence second evaluation | 53537 0000195 | cassava mosaic disease incidence 1-month evaluation | 20268 0000196 | cassava mosaic disease incidence 3-month evaluation | 20541 0000197 | cassava mosaic disease incidence 9-month evaluation | 1280 0000198 | cassava mosaic disease incidence 6-month evaluation | 17261 0000200 | cassava mosaic disease incidence 12-month evaluation | 1114 0000202 | cassava brown streak disease root incidence 12-month evaluation | 561 0000208 | cassava brown streak disease leaf incidence 3-month evaluation | 3269 0000209 | cassava brown streak disease leaf incidence 6-month evaluation | 5239 0000210 | cassava brown streak disease leaf incidence 9-month evaluation | 18 0000211 | cassava bacterial blight incidence 12-month evaluation | 4638

nmenda commented 7 years ago

accession | name | min | max | prop_min | prop_max | count -----------+----------------------------------------------+-------+--------+----------+----------+------- 0000037 | Cassava bacterial blight incidence | 0.00 | 177.78 | 1 | 9 | 1712 0000038 | Cassava anthracnose disease incidence | 0.00 | 125.00 | 3 | 12 | 1711 0000039 | Cassava mosaic disease incidence | 0 | 214.29 | 1 | 9 | 2076 0000040 | Cassava brown streak disease leaf incidence | 0 | 100 | 3 | 6 | 306 0000122 | Cassava green mite incidence | 2.63 | 100.00 | 6 | 9 | 99

nmenda commented 7 years ago

@aco46 is working on the NaCRRI and NRCRI trials, here are the trials with potential issues

phenotype_rows | min | max | project_id | name | name
----------------+------+-------------------+------------+------------------------------------------------+-------- 904 | 0 | 100.0 | 143 | Uganda Cassava Training Population NaCRRI_2012 | NaCRRI 2983 | 0.0 | 100.0 | 461 | Uganda Cassava Training Population Kasese_2013 | NaCRRI 467 | 0.0 | 100.0 | 551 | Within Family Prediction Seedling Trial_2013 | NaCRRI 1277 | 0.0 | 100.0 | 1632 | Uganda Cassava Training Population Kasese_2014 | NaCRRI 3315 | 0.0 | 100.0 | 1647 | Uganda Cassava Training Population Ngetta_2014 | NaCRRI 6099 | 0.0 | 100 | 1650 | Uganda Cassava Training Population Ngetta_2013 | NaCRRI 3538 | 0 | 100 | 1791 | Uganda Training Population NaCRRI_2014 | NaCRRI 300 | 0 | 0.931034482758621 | 1797 | Degeneration Trial_Arua_2015 | NaCRRI 256 | 0 | 100 | 132 | 10uytUM | NRCRI 256 | 0 | 100 | 133 | 11uyt10RepeatUM | NRCRI 920 | 0.00 | 150.00 | 135 | 10cetNR10seriesUM | NRCRI 1204 | 0.00 | 225.00 | 136 | 10cetNR10BseriesUM | NRCRI 216 | 0.00 | 100.00 | 137 | 09pytIgariam | NRCRI 56 | 0.00 | 29.17 | 138 | 09pytOtobi | NRCRI 280 | 0.00 | 100.00 | 139 | 09pytUmudike | NRCRI 4847 | 0.00 | 100.00 | 145 | 11cetNR11seriesUM | NRCRI 456 | 0.00 | 80.00 | 155 | 11aytNR09seriesUM | NRCRI 375 | 0.00 | 100.00 | 156 | PYT 2010 | NRCRI 396 | 0.00 | 100.00 | 157 | 10uytCIATsetAUM | NRCRI 676 | 1 | 20 | 190 | 11clonal63cpUM | NRCRI 26470 | 0.00 | 100.00 | 552 | 12nextgen500tp1UM | NRCRI 74576 | 0.00 | 100.00 | 1597 | 13nextgen518tp1UM | NRCRI 14484 | 0.00 | 100.00 | 1598 | 13nextgen518tp1OT | NRCRI 9981 | 0.00 | 100.00 | 1599 | 13nextgen518tp1KA | NRCRI 6270 | 0.00 | 100 | 1600 | 13nextgen489tp2UM | NRCRI 972 | 0.00 | 32.00 | 1603 | 13ayt25nr11UM | NRCRI 5723 | 0.00 | 5.00 | 1618 | 13clonal261nr13UM | NRCRI 480 | 0.00 | 27.00 | 1619 | 13uyt15nr06-09UM | NRCRI 792 | 0.00 | 21.00 | 1620 | 13pyt20nr12UM | NRCRI 1253 | 0.00 | 19.00 | 1621 | 13mp70popfOT | NRCRI 4078 | 0.00 | 20.00 | 1623 | 13mp259popdOT | NRCRI 360 | 0.00 | 20.00 | 1624 | 13mp17popfUM | NRCRI 672 | 0.00 | 100.00 | 1625 | 13cmd35gpOT | NRCRI 2048 | 0.00 | 19.00 | 1626 | 13mp144popdUM | NRCRI 1042 | 0.00 | 22.00 | 1627 | 13cmd60gpUM | NRCRI 289 | 0.00 | 100.00 | 1628 | 12cmd36gpOT | NRCRI 982 | 0.00 | 18.00 | 1629 | 12cmd75gpUM | NRCRI 503 | 0.00 | 20.00 | 1630 | 11cmd48gpUM | NRCRI 480 | 0.00 | 100.00 | 1642 | 15cb28cgmUM | NRCRI 432 | 0 | 30 | 2820 | 14mlt12hbc_UM | NRCRI

nmenda commented 7 years ago

green mite incidence traits were loaded with integers between 1-5 , looks like a scale e.g. some IITA trials :

select distinct count(phenotype_id) as phenotype_rows, min( cast(phenotype.value as numeric )), max(cast(phenotype.value as numeric)) , pname.project_id, pname.name, bprogram.name from phenotype join cvterm on cvterm_id = observable_id join nd_experiment_phenotype using (phenotype_id) join nd_experiment_project using (nd_experiment_id) join project as pname using (project_id) join project_relationship on project_relationship.subject_project_id = pname.project_id join project as bprogram on bprogram.project_id = project_relationship.object_project_id where cvterm.name ilike '%green%incidence%' AND phenotype.value not like ' %' AND phenotype.value not like '-' and phenotype.value not like '#%' AND phenotype.value not like '%/%' AND bprogram.name ilike 'IITA%' group by pname.project_id, pname.name , bprogram.name order by bprogram.name , pname.project_id ;

phenotype_rows | min | max | project_id | name | name ----------------+-----+-----+------------+-----------------------------+------ 112 | 1.0 | 5.0 | 182 | 11ayt14yrt8IB | IITA 214 | 1.0 | 5.0 | 196 | 11ayt27yrt9IB | IITA 206 | 2.0 | 4.0 | 200 | 11uyt25sgIB | IITA 216 | 2.0 | 3.0 | 213 | 12ayt27yrtIB | IITA 480 | 2.0 | 5.0 | 214 | 12ayt20yrtIB | IITA 240 | 2 | 5 | 216 | 12ayt30yrt11IB | IITA 256 | 2.0 | 3.0 | 217 | 12ayt32mixedIB | IITA 380 | 2.0 | 4.0 | 218 | 12ayt33mixedpdIB | IITA 277 | 2.0 | 3.0 | 219 | 12ayt35ppdIB | IITA 274 | 1.0 | 4.0 | 220 | 12ayt35yrt10IB | IITA 296 | 2.0 | 4.0 | 221 | 12ayt37mixedIB | IITA 312 | 2.0 | 5.0 | 222 | 12ayt39yrt12IB | IITA 724 | 2.0 | 5.0 | 223 | 12ayt66pdwrtIB | IITA 334 | 2.0 | 5.0 | 224 | 12pyt86yrtIB | IITA

aafolabi commented 7 years ago

I have checked some of the trials listed. These are cgm severity scores. We score it twice ( Beginning and the end of dry season) as cgm1 and cgm2.

nmenda commented 7 years ago

looks like it will be safer to reload the NRCRI trials that have issues with disease incidence and severity for these 19 trials 132,133,135,136,137,138,139,143,145,146,149,150,151,152,155,156,157,159,160

nmenda commented 7 years ago

Look at the 9 disease incidence traits that need to be mapped if possible to the relevant variables: accession | name | rel_type
-----------+-----------------------------------------------------------------+------------- 0000037 | Cassava bacterial blight incidence | is_a 0000038 | Cassava anthracnose disease incidence | is_a 0000039 | Cassava mosaic disease incidence | is_a 0000040 | Cassava brown streak disease leaf incidence | is_a 0000089 | Cassava brown streak disease root incidence | is_a 0000122 | Cassava green mite incidence | is_a 0000339 | Cassava Mealy Bug Incidence | is_a 0000443 | Red Spider Mite Incidence | is_a 0000464 | Spiraling Whitefly Incidence | is_a

then do the same for the disease severity traits:

https://github.com/solgenomics/sgn/issues/771#issuecomment-267692736

nmenda commented 7 years ago

Check the values we have for the 24 incidence trait variables, convert from % to 0-1 proportion

0000041 | cassava mealy bug incidence by ratio | VARIABLE_OF 0000043 | spiraling whitefly incidence by proportion | VARIABLE_OF 0000045 | red spider mite incidence by proportion | VARIABLE_OF 0000178 | cassava bacterial blight incidence 3-month evaluation | VARIABLE_OF 0000179 | cassava bacterial blight incidence 6-month evaluation | VARIABLE_OF 0000180 | cassava bacterial blight incidence 9-month evaluation | VARIABLE_OF 0000181 | cassava anthractnose disease incidence in 6-month | VARIABLE_OF 0000182 | cassava anthractnose disease incidence in 9-month | VARIABLE_OF 0000183 | cassava anthractnose disease incidence in12-month | VARIABLE_OF 0000187 | cassava green mite incidence first evaluation | VARIABLE_OF 0000188 | cassava green mite incidence second evaluation | VARIABLE_OF 0000195 | cassava mosaic disease incidence 1-month evaluation | VARIABLE_OF 0000196 | cassava mosaic disease incidence 3-month evaluation | VARIABLE_OF 0000197 | cassava mosaic disease incidence 9-month evaluation | VARIABLE_OF 0000198 | cassava mosaic disease incidence 6-month evaluation | VARIABLE_OF 0000200 | cassava mosaic disease incidence 12-month evaluation | VARIABLE_OF 0000202 | cassava brown streak disease root incidence 12-month evaluation | VARIABLE_OF 0000207 | cassava brown streak disease leaf incidence 1-month evaluation | VARIABLE_OF 0000208 | cassava brown streak disease leaf incidence 3-month evaluation | VARIABLE_OF 0000209 | cassava brown streak disease leaf incidence 6-month evaluation | VARIABLE_OF 0000210 | cassava brown streak disease leaf incidence 9-month evaluation | VARIABLE_OF 0000211 | cassava bacterial blight incidence 12-month evaluation | VARIABLE_OF 0000219 | cassava anthractnose disease incidence in 3-month | VARIABLE_OF 0000235 | female flower incidence in proportion | VARIABLE_OF

nmenda commented 7 years ago

Add cvtermprop trait_maximum = 1 and trait_minimum = 0 to the 24 incidence traits

nmenda commented 7 years ago

add the max, min condition to https://github.com/solgenomics/sgn/blob/master/lib/CXGN/Phenotypes/StorePhenotypes.pm#L245

and to

https://github.com/solgenomics/sgn/blob/master/lib/CXGN/Fieldbook/TraitInfo.pm#L88

nmenda commented 7 years ago

https://github.com/solgenomics/sgn-home/blob/master/naama/iita_disease_incidence_trials.tab

list of IITA trials with problematic disease incidence values

lukasmueller commented 3 years ago

Disease incidence values are usually 1-5.