Closed nmenda closed 1 year ago
Many disease incidence rows with phenotype values > 1 Many of these of value of null or a string with some spaces. Also some '0' values were loaded as 0.0 or 0.00 (though this is not a big deal - when casting the value as numeric it's all read as 0 )
select distinct dbxref.accession, cvterm.name , count(phenotype_id) from phenotype join cvterm on cvterm_id = observable_id join dbxref using (dbxref_id) where cvterm.name ilike '%incidence%evaluation%' AND value not like '0%' AND value not like '1' group by dbxref.accession, cvterm.name order by accession ;
accession | name | count -----------+-----------------------------------------------------------------+------- 0000178 | cassava bacterial blight incidence 3-month evaluation | 33749 0000179 | cassava bacterial blight incidence 6-month evaluation | 44570 0000180 | cassava bacterial blight incidence 9-month evaluation | 8114 0000187 | cassava green mite incidence first evaluation | 81915 0000188 | cassava green mite incidence second evaluation | 53537 0000195 | cassava mosaic disease incidence 1-month evaluation | 20268 0000196 | cassava mosaic disease incidence 3-month evaluation | 20541 0000197 | cassava mosaic disease incidence 9-month evaluation | 1280 0000198 | cassava mosaic disease incidence 6-month evaluation | 17261 0000200 | cassava mosaic disease incidence 12-month evaluation | 1114 0000202 | cassava brown streak disease root incidence 12-month evaluation | 561 0000208 | cassava brown streak disease leaf incidence 3-month evaluation | 3269 0000209 | cassava brown streak disease leaf incidence 6-month evaluation | 5239 0000210 | cassava brown streak disease leaf incidence 9-month evaluation | 18 0000211 | cassava bacterial blight incidence 12-month evaluation | 4638
accession | name | min | max | prop_min | prop_max | count -----------+----------------------------------------------+-------+--------+----------+----------+------- 0000037 | Cassava bacterial blight incidence | 0.00 | 177.78 | 1 | 9 | 1712 0000038 | Cassava anthracnose disease incidence | 0.00 | 125.00 | 3 | 12 | 1711 0000039 | Cassava mosaic disease incidence | 0 | 214.29 | 1 | 9 | 2076 0000040 | Cassava brown streak disease leaf incidence | 0 | 100 | 3 | 6 | 306 0000122 | Cassava green mite incidence | 2.63 | 100.00 | 6 | 9 | 99
@aco46 is working on the NaCRRI and NRCRI trials, here are the trials with potential issues
phenotype_rows | min | max | project_id | name | name
----------------+------+-------------------+------------+------------------------------------------------+--------
904 | 0 | 100.0 | 143 | Uganda Cassava Training Population NaCRRI_2012 | NaCRRI
2983 | 0.0 | 100.0 | 461 | Uganda Cassava Training Population Kasese_2013 | NaCRRI
467 | 0.0 | 100.0 | 551 | Within Family Prediction Seedling Trial_2013 | NaCRRI
1277 | 0.0 | 100.0 | 1632 | Uganda Cassava Training Population Kasese_2014 | NaCRRI
3315 | 0.0 | 100.0 | 1647 | Uganda Cassava Training Population Ngetta_2014 | NaCRRI
6099 | 0.0 | 100 | 1650 | Uganda Cassava Training Population Ngetta_2013 | NaCRRI
3538 | 0 | 100 | 1791 | Uganda Training Population NaCRRI_2014 | NaCRRI
300 | 0 | 0.931034482758621 | 1797 | Degeneration Trial_Arua_2015 | NaCRRI
256 | 0 | 100 | 132 | 10uytUM | NRCRI
256 | 0 | 100 | 133 | 11uyt10RepeatUM | NRCRI
920 | 0.00 | 150.00 | 135 | 10cetNR10seriesUM | NRCRI
1204 | 0.00 | 225.00 | 136 | 10cetNR10BseriesUM | NRCRI
216 | 0.00 | 100.00 | 137 | 09pytIgariam | NRCRI
56 | 0.00 | 29.17 | 138 | 09pytOtobi | NRCRI
280 | 0.00 | 100.00 | 139 | 09pytUmudike | NRCRI
4847 | 0.00 | 100.00 | 145 | 11cetNR11seriesUM | NRCRI
456 | 0.00 | 80.00 | 155 | 11aytNR09seriesUM | NRCRI
375 | 0.00 | 100.00 | 156 | PYT 2010 | NRCRI
396 | 0.00 | 100.00 | 157 | 10uytCIATsetAUM | NRCRI
676 | 1 | 20 | 190 | 11clonal63cpUM | NRCRI
26470 | 0.00 | 100.00 | 552 | 12nextgen500tp1UM | NRCRI
74576 | 0.00 | 100.00 | 1597 | 13nextgen518tp1UM | NRCRI
14484 | 0.00 | 100.00 | 1598 | 13nextgen518tp1OT | NRCRI
9981 | 0.00 | 100.00 | 1599 | 13nextgen518tp1KA | NRCRI
6270 | 0.00 | 100 | 1600 | 13nextgen489tp2UM | NRCRI
972 | 0.00 | 32.00 | 1603 | 13ayt25nr11UM | NRCRI
5723 | 0.00 | 5.00 | 1618 | 13clonal261nr13UM | NRCRI
480 | 0.00 | 27.00 | 1619 | 13uyt15nr06-09UM | NRCRI
792 | 0.00 | 21.00 | 1620 | 13pyt20nr12UM | NRCRI
1253 | 0.00 | 19.00 | 1621 | 13mp70popfOT | NRCRI
4078 | 0.00 | 20.00 | 1623 | 13mp259popdOT | NRCRI
360 | 0.00 | 20.00 | 1624 | 13mp17popfUM | NRCRI
672 | 0.00 | 100.00 | 1625 | 13cmd35gpOT | NRCRI
2048 | 0.00 | 19.00 | 1626 | 13mp144popdUM | NRCRI
1042 | 0.00 | 22.00 | 1627 | 13cmd60gpUM | NRCRI
289 | 0.00 | 100.00 | 1628 | 12cmd36gpOT | NRCRI
982 | 0.00 | 18.00 | 1629 | 12cmd75gpUM | NRCRI
503 | 0.00 | 20.00 | 1630 | 11cmd48gpUM | NRCRI
480 | 0.00 | 100.00 | 1642 | 15cb28cgmUM | NRCRI
432 | 0 | 30 | 2820 | 14mlt12hbc_UM | NRCRI
green mite incidence traits were loaded with integers between 1-5 , looks like a scale e.g. some IITA trials :
select distinct count(phenotype_id) as phenotype_rows, min( cast(phenotype.value as numeric )), max(cast(phenotype.value as numeric)) , pname.project_id, pname.name, bprogram.name from phenotype join cvterm on cvterm_id = observable_id join nd_experiment_phenotype using (phenotype_id) join nd_experiment_project using (nd_experiment_id) join project as pname using (project_id) join project_relationship on project_relationship.subject_project_id = pname.project_id join project as bprogram on bprogram.project_id = project_relationship.object_project_id where cvterm.name ilike '%green%incidence%' AND phenotype.value not like ' %' AND phenotype.value not like '-' and phenotype.value not like '#%' AND phenotype.value not like '%/%' AND bprogram.name ilike 'IITA%' group by pname.project_id, pname.name , bprogram.name order by bprogram.name , pname.project_id ;
phenotype_rows | min | max | project_id | name | name ----------------+-----+-----+------------+-----------------------------+------ 112 | 1.0 | 5.0 | 182 | 11ayt14yrt8IB | IITA 214 | 1.0 | 5.0 | 196 | 11ayt27yrt9IB | IITA 206 | 2.0 | 4.0 | 200 | 11uyt25sgIB | IITA 216 | 2.0 | 3.0 | 213 | 12ayt27yrtIB | IITA 480 | 2.0 | 5.0 | 214 | 12ayt20yrtIB | IITA 240 | 2 | 5 | 216 | 12ayt30yrt11IB | IITA 256 | 2.0 | 3.0 | 217 | 12ayt32mixedIB | IITA 380 | 2.0 | 4.0 | 218 | 12ayt33mixedpdIB | IITA 277 | 2.0 | 3.0 | 219 | 12ayt35ppdIB | IITA 274 | 1.0 | 4.0 | 220 | 12ayt35yrt10IB | IITA 296 | 2.0 | 4.0 | 221 | 12ayt37mixedIB | IITA 312 | 2.0 | 5.0 | 222 | 12ayt39yrt12IB | IITA 724 | 2.0 | 5.0 | 223 | 12ayt66pdwrtIB | IITA 334 | 2.0 | 5.0 | 224 | 12pyt86yrtIB | IITA
I have checked some of the trials listed. These are cgm severity scores. We score it twice ( Beginning and the end of dry season) as cgm1 and cgm2.
looks like it will be safer to reload the NRCRI trials that have issues with disease incidence and severity for these 19 trials 132,133,135,136,137,138,139,143,145,146,149,150,151,152,155,156,157,159,160
Look at the 9 disease incidence traits that need to be mapped if possible to the relevant variables:
accession | name | rel_type
-----------+-----------------------------------------------------------------+-------------
0000037 | Cassava bacterial blight incidence | is_a
0000038 | Cassava anthracnose disease incidence | is_a
0000039 | Cassava mosaic disease incidence | is_a
0000040 | Cassava brown streak disease leaf incidence | is_a
0000089 | Cassava brown streak disease root incidence | is_a
0000122 | Cassava green mite incidence | is_a
0000339 | Cassava Mealy Bug Incidence | is_a
0000443 | Red Spider Mite Incidence | is_a
0000464 | Spiraling Whitefly Incidence | is_a
then do the same for the disease severity traits:
https://github.com/solgenomics/sgn/issues/771#issuecomment-267692736
Check the values we have for the 24 incidence trait variables, convert from % to 0-1 proportion
0000041 | cassava mealy bug incidence by ratio | VARIABLE_OF 0000043 | spiraling whitefly incidence by proportion | VARIABLE_OF 0000045 | red spider mite incidence by proportion | VARIABLE_OF 0000178 | cassava bacterial blight incidence 3-month evaluation | VARIABLE_OF 0000179 | cassava bacterial blight incidence 6-month evaluation | VARIABLE_OF 0000180 | cassava bacterial blight incidence 9-month evaluation | VARIABLE_OF 0000181 | cassava anthractnose disease incidence in 6-month | VARIABLE_OF 0000182 | cassava anthractnose disease incidence in 9-month | VARIABLE_OF 0000183 | cassava anthractnose disease incidence in12-month | VARIABLE_OF 0000187 | cassava green mite incidence first evaluation | VARIABLE_OF 0000188 | cassava green mite incidence second evaluation | VARIABLE_OF 0000195 | cassava mosaic disease incidence 1-month evaluation | VARIABLE_OF 0000196 | cassava mosaic disease incidence 3-month evaluation | VARIABLE_OF 0000197 | cassava mosaic disease incidence 9-month evaluation | VARIABLE_OF 0000198 | cassava mosaic disease incidence 6-month evaluation | VARIABLE_OF 0000200 | cassava mosaic disease incidence 12-month evaluation | VARIABLE_OF 0000202 | cassava brown streak disease root incidence 12-month evaluation | VARIABLE_OF 0000207 | cassava brown streak disease leaf incidence 1-month evaluation | VARIABLE_OF 0000208 | cassava brown streak disease leaf incidence 3-month evaluation | VARIABLE_OF 0000209 | cassava brown streak disease leaf incidence 6-month evaluation | VARIABLE_OF 0000210 | cassava brown streak disease leaf incidence 9-month evaluation | VARIABLE_OF 0000211 | cassava bacterial blight incidence 12-month evaluation | VARIABLE_OF 0000219 | cassava anthractnose disease incidence in 3-month | VARIABLE_OF 0000235 | female flower incidence in proportion | VARIABLE_OF
Add cvtermprop trait_maximum = 1 and trait_minimum = 0 to the 24 incidence traits
https://github.com/solgenomics/sgn-home/blob/master/naama/iita_disease_incidence_trials.tab
list of IITA trials with problematic disease incidence values
Disease incidence values are usually 1-5.
there are many trials in cassavabase with values that are not proportion of plants , but rather a count of number of plants. e.g. https://www.cassavabase.org/breeders/trial/1598
We cannot update these values, since the total number of plants per plot at the time of measurement is not in the database, but this number should be in the original files.
Need to delete those phenotype rows, and reload the corrected values .
Some older trials from Chiedozie do not have this data, and the values will have to be deleted without replacement e.g. https://www.cassavabase.org/breeders/trial/135
Need to add for all the disease incidence traits a cvtermprop to check the values at the time of uploading phenotypes