Open adf-ncgr opened 4 months ago
per @maxglycine:
I will have to start processing QTL papers so I have been looking at the DS and how QTL information is held and I have a couple of questions. The experiment maps are of limited use. This is particularly acute with the older QTL experiments. Many of those experiments were performed before the makeup of the linkage was established. As such, they refer to linkage groups or fragments of linkage groups that are no longer used or recognized. So is it necessary to have experiment maps? We (SoyBase) has not been collecting that information for a long time. We just transfer the experiment QTL positions to Composite2003 coordinates, if possible, and leave it there. If we just use the Composite2003 positions, what should the DS name be? Do we keep the cross information in the title? I think I have seen papers where they made separate maps of the same cross but performed from plants grown in different years. How would that figure into the file “name”? Would we add a suffix to the cross ie Forrest_x_Wm82a.qtl.Smith_Jones_2023 and Forrest_x_Wm82b.qtl.Smith_Jones_2023?
@maxglycine with respect to the QTL info in the datastore, I think much of what is there for soybean may have resulted from some processing Sam did against info dumped from the mysql version of soybase. He probably assumed that if it was worth being in the database, it was worth going into the datastore. It's certainly OK with me if we decide to drop the experiment maps.
However, looking at the data in one example: /usr/local/www/data/v2/Glycine/max/qtl/Young_x_PI416937.qtl.Lee_Bailey_1996a/glyma.Young_x_PI416937.qtl.Lee_Bailey_1996a.qtlmrk.tsv.gz It looks like the qtl markers are associated with a variety of maps, though each qtl seems to have only one association. So if we just dropped the ones that aren't GmComposite2003 we'd probably lose qtls; I suspect this may be a flaw in the procedure Sam used to produce these data store files from the soybase database, though I might need to do some detective work to verify this.
I don't think the decision on using only Composite2003 positions needs to change our DS naming conventions, though I'm also not wedded to the conventions we have for QTLs. Not sure how much of the current code depends on our naming conventions here, but I doubt there's much code outside the intermine loaders that's trying to use it (maybe also cmap-js?)
Maybe this could be an agenda item for Thursday's meeting.
@maxglycine I've now dug into Sam's code a bit and I do think there's a bit of a bug in that he's only reporting a single qtl-linkage group relationship when there may be multiple. For example in the db we have:
MySQL localhost:3306 soybase SQL > select * from qtl_position_table where QTLName='Seed protein 4-4';
+-------+------------------+---------------------+----------+----------+------+----------+
| QTLID | QTLName | MapName | LeftEnd | RightEnd | LG | Centroid |
+-------+------------------+---------------------+----------+----------+------+----------+
| 540 | Seed protein 4-4 | GmRFLP-GA1996a_C1.1 | 0.00000 | 5.40 | C1.1 | 3.00 |
| 540 | Seed protein 4-4 | GmComposite1999_C1 | 32.10000 | 34.10 | C1 | 33.00 |
| 540 | Seed protein 4-4 | GmComposite2003_C1 | 20.04000 | 22.04 | C1 | 21.00 |
+-------+------------------+---------------------+----------+----------+------+----------+
but in the glyma.Young_x_PI416937.gen.Lee_Bailey_1996a.qtlmrk.tsv file we have only:
Seed protein 4-4 Seed protein A463_1 GmRFLP-GA1996a_C1.1
Now, it looks like it is pretty easy to simply restrict the target map to GmComposite2003 and report those as being the associations which avoids this complication. But, I did notice that some of the QTL end up getting "left behind" using this approach, e.g. Seed protein 4-12 which has only an experimental map position:
MySQL localhost:3306 soybase SQL > select * from qtl_position_table where QTLName='Seed protein 4-12';
+-------+-------------------+-----------------------+---------+----------+--------+----------+
| QTLID | QTLName | MapName | LeftEnd | RightEnd | LG | Centroid |
+-------+-------------------+-----------------------+---------+----------+--------+----------+
| 548 | Seed protein 4-12 | GmRFLP-GA1996a_GA1_25 | 0.00000 | 14.60 | GA1_25 | 7.00 |
+-------+-------------------+-----------------------+---------+----------+--------+----------+
Maybe this is OK, but I thought I should check and also verify that you weren't overlooking GmComposite1999; including it would be possible but would require that I actually fix the bug with multiple linkage group assignments rather than just avoid it by having only the one target map set. Let me know your thoughts when you get a chance.
Andrew @adf-ncgr the SoyBase Classic has only been displaying the Composite2003 genetic map positions and suppressing any mention of the "other" maps, so doing that in the mines are OK with me. Yes, it means that some QTL will not have a position on the Composite2003 map. We have two ways to go. One way is to ignore anything W/O a Composite2003 position the other is to return the unique list of all QTL and any Composite2003 position related to them. Understanding that not all QTL will have a position. I would personally come down on option 2, the unique list of QTL with any Composite 2003 position for each QTL. Since I don't know the data model and path queries it would be up to you to tell me what is possible. @StevenCannon-USDA @jd-campbell
per @maxglycine :
per @adf-ncgr :