legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

Missing Data: mixed.qtl.Hwang_King_2016 #206

Open jd-campbell opened 6 months ago

jd-campbell commented 6 months ago

I have noticed that there is missing data in the mixed.qtl.Hwang_King_2016 directory. The *qtl.tsv file only contains 2 QTLs but the SoyBase MySQL database lists 27 QTLs.

#qtl_identifier trait_name  genetic_map linkage_group   start   end peak
mqCanopy wilt-019   Canopy wilt GmComposite2003 A1  0.98    2.98    1.98                        
mqCanopy wilt-021   Canopy wilt GmComposite2003 D2  46.8    48.8    47.8                        

@jd-campbell Will review the paper and SoyBase MySQL to ensure all the data is in the DS.

adf-ncgr commented 6 months ago

@jd-campbell not %100 sure but this sounds potentially related to some other issues that I'm guessing may stem from scripts that Sam had written to generate the files for soybean QTLs from the info in the soybase mysql. At present I have no clue as to where those scripts may be but will send a flare up to Sam and see if he has any recollection of where he might have put them.

adf-ncgr commented 6 months ago

Sam was super-fast and helpful in his response. The scripts are here: https://github.com/sammyjava/SoyBase He did say that the direct outputs were subjected to ad hoc munging due to naming conflicts and the like, but seems like a good place to start (provided I can actually figure out how to run the scripts, which he said require some ssh tunneling to the mysql db). Anyway, this may also be relevant for #205 so I'll hopefully be able to make some headway on it.

jd-campbell commented 6 months ago

@adf-ncgr Thanks for the info. This helps in my work. Please send my thanks to Sam also!

adf-ncgr commented 6 months ago

@jd-campbell not sure this one is quite ready to be closed, but here's an update. I got Sam's code to run and for this dataset it seems to have produced 26 QTLs, although one of them (mqCanopy wilt-013) looks like it may be problematic without location info:

mqCanopy wilt-014       Canopy wilt     GmComposite2003_D1b     50.11   52.61   51.36
mqCanopy wilt-019       Canopy wilt     GmComposite2003_A1      0.98    2.98    1.98
mqCanopy wilt-021       Canopy wilt     GmComposite2003_D2      46.8    48.8    47.8
mqCanopy wilt-013       Canopy wilt
mqCanopy wilt-008       Canopy wilt     GmComposite2003_A1      16.16   18.16   17.16
mqCanopy wilt-012       Canopy wilt     GmComposite2003_D2      124.0   126.0   125.0
mqCanopy wilt-015       Canopy wilt     GmComposite2003_D2      114.97  124.02  119.5
mqCanopy wilt-007       Canopy wilt     GmComposite2003_A1      2.54    4.54    3.54
mqCanopy wilt-023       Canopy wilt     GmComposite2003_D1b     47.69   49.69   48.69
mqCanopy wilt-005       Canopy wilt     GmComposite2003_D1b     83.04   85.04   84.04
mqCanopy wilt-011       Canopy wilt     GmComposite2003_D2      56.07   58.07   57.07
mqCanopy wilt-022       Canopy wilt     GmComposite2003_D1b     33.42   35.42   34.42
mqCanopy wilt-016       Canopy wilt     GmComposite2003_D1b     3.79    6.54    5.17
mqCanopy wilt-026       Canopy wilt     GmComposite2003_L       47.2    49.2    48.2
mqCanopy wilt-002       Canopy wilt     GmComposite2003_D1b     0.0     1.0     0.5
mqCanopy wilt-017       Canopy wilt     GmComposite2003_D1b     4.51    6.51    5.51
mqCanopy wilt-024       Canopy wilt     GmComposite2003_B1      33.25   35.25   34.25
mqCanopy wilt-006       Canopy wilt     GmComposite2003_D2      51.4    53.4    52.4
mqCanopy wilt-010       Canopy wilt     GmComposite2003_B1      54.8    56.8    55.8
mqCanopy wilt-027       Canopy wilt     GmComposite2003_D2      125.5   127.5   126.5
mqCanopy wilt-001       Canopy wilt     GmComposite2003_D1b     11.58   13.58   12.58
mqCanopy wilt-009       Canopy wilt     GmComposite2003_B1      75.1    77.1    76.1
mqCanopy wilt-003       Canopy wilt     GmComposite2003_D1b     51.61   53.61   52.61
mqCanopy wilt-020       Canopy wilt     GmComposite2003_B1      64.82   85.59   75.21
mqCanopy wilt-018       Canopy wilt     GmComposite2003_D1b     84.04   85.59   84.82
mqCanopy wilt-025       Canopy wilt     GmComposite2003_L       81.9    83.9    82.9

In any case, I'm not sure why the datastore file would only have 2 QTLs since this one seems more complete (though maybe still not entirely complete?). I'll try to explore a little more but wanted to let you know there's at least some progress on this.

adf-ncgr commented 6 months ago

OK, it looks like the issue with that one QTL without location info is probably a data issue, and not the fault of the code. mqCanopy wilt-013 is one of ~40 QTLs without an entry in the qtl_position_table :

select QTLID, QTLName from qtl_table where QTLID not in (select QTLID from qtl_position_table);
+-------+-----------------------------+
| QTLID | QTLName                     |
+-------+-----------------------------+
|    18 | Chlorimuron sensitivity 1-4 |
|    19 | Chlorimuron sensitivity 1-5 |
|    20 | Chlorimuron sensitivity 1-6 |
|    27 | Chlorimuron sensitivity 2-2 |
|  1208 | cqSeed protein-002          |
|    72 | Fe effic 2-1                |
|   163 | Leaflet ash 1-6             |
|  1410 | Leaflet shape 9-5           |
|   175 | Lodging 4-1                 |
|  4291 | mqCanopy wilt-013           |
|   440 | Plant height 11-4           |
|  4072 | Plant height 37-7           |
|   393 | Plant height 4-3            |
|   408 | Plant height 5-14           |
|   418 | Plant height 6-10           |
|   421 | Plant height 6-13           |
|   417 | Plant height 6-9            |
|   425 | Plant height 7-3            |
|   464 | Pod dehiscence 1-11         |
|   465 | Pod dehiscence 1-12         |
|  2534 | Sclero 8-4                  |
|   736 | SCN 10-2                    |
|   732 | SCN 9-4                     |
|   733 | SCN 9-5                     |
|   950 | SDS 8-4                     |
|   554 | Seed protein 5-5            |
|   555 | Seed protein 5-6            |
|   976 | Seed sucrose 1-11           |
|   977 | Seed sucrose 1-12           |
|   979 | Seed sucrose 1-14           |
|   980 | Seed sucrose 1-15           |
|   981 | Seed sucrose 1-16           |
|   982 | Seed sucrose 1-17           |
|   826 | Seed weight 3-7             |
|   828 | Seed weight 3-9             |
|  1193 | Seed yield 15-14            |
|   894 | Seed yield 3-3              |
|   965 | Stem length, main 1-1       |
+-------+-----------------------------+
38 rows in set (0.0682 sec)

Also note that the db seems to have only 26 not 27 QTLs (at least, per select count(*) from qtl_table where QTLName like 'mqCanopy wilt%'), so I think the version I got out of running the code is probably close to correct. Let me know if you think that one QTL missing a position can be fixed in the db, otherwise I'll just replace the datastore file with the new one.

maxglycine commented 5 months ago

@adf-ncgr @jd-campbell Since the paper says that mqCanopy wilt-013 (QTL name 5-2) is only associated with Satt229, the position values should be 92.88 94.88 93.88. That is 1 cM on each side of Satt229 which the database says is at 93.88 on LG L or Gm19. I am not sure why it was left out, but this record was problematic and had to be adjusted after the data was originally entered by an undergrad student worker. I have inserted mqCanopy wilt-013 into both stage and production MySQL databases.