legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance
Other
1 stars 0 forks source link

GlycineMine: “experiment” maps are not really useable (?) #205

Open adf-ncgr opened 4 months ago

adf-ncgr commented 4 months ago

per @maxglycine :

In the “QTL” type section, it would be nice if we only saw map positions on the GmComposite2003 map. The “experiment” maps are not really useable. Let me know if we can limit the map positions to the GmComposite2003 map only. (I know some QTL are in the mine with GmComposite2003 positions)

per @adf-ncgr :

As far as "GmComposite2003" for QTL is concerned, I think I understand what you're getting at but will have to look into details further to better understand how this is all being done. If the "experiment" maps aren't usable, maybe we should just not load them into the mine at all? Not sure what the implications of that "Gordian knot" approach might be, but anyway I'll put this request into a github issue so I don't forget about it.

adf-ncgr commented 4 months ago

per @maxglycine:

I will have to start processing QTL papers so I have been looking at the DS and how QTL information is held and I have a couple of questions. The experiment maps are of limited use. This is particularly acute with the older QTL experiments. Many of those experiments were performed before the makeup of the linkage was established. As such, they refer to linkage groups or fragments of linkage groups that are no longer used or recognized. So is it necessary to have experiment maps? We (SoyBase) has not been collecting that information for a long time. We just transfer the experiment QTL positions to Composite2003 coordinates, if possible, and leave it there. If we just use the Composite2003 positions, what should the DS name be? Do we keep the cross information in the title? I think I have seen papers where they made separate maps of the same cross but performed from plants grown in different years. How would that figure into the file “name”? Would we add a suffix to the cross ie Forrest_x_Wm82a.qtl.Smith_Jones_2023 and Forrest_x_Wm82b.qtl.Smith_Jones_2023?

adf-ncgr commented 4 months ago

@maxglycine with respect to the QTL info in the datastore, I think much of what is there for soybean may have resulted from some processing Sam did against info dumped from the mysql version of soybase. He probably assumed that if it was worth being in the database, it was worth going into the datastore. It's certainly OK with me if we decide to drop the experiment maps.

However, looking at the data in one example: /usr/local/www/data/v2/Glycine/max/qtl/Young_x_PI416937.qtl.Lee_Bailey_1996a/glyma.Young_x_PI416937.qtl.Lee_Bailey_1996a.qtlmrk.tsv.gz It looks like the qtl markers are associated with a variety of maps, though each qtl seems to have only one association. So if we just dropped the ones that aren't GmComposite2003 we'd probably lose qtls; I suspect this may be a flaw in the procedure Sam used to produce these data store files from the soybase database, though I might need to do some detective work to verify this.

I don't think the decision on using only Composite2003 positions needs to change our DS naming conventions, though I'm also not wedded to the conventions we have for QTLs. Not sure how much of the current code depends on our naming conventions here, but I doubt there's much code outside the intermine loaders that's trying to use it (maybe also cmap-js?)

Maybe this could be an agenda item for Thursday's meeting.

adf-ncgr commented 3 months ago

@maxglycine I've now dug into Sam's code a bit and I do think there's a bit of a bug in that he's only reporting a single qtl-linkage group relationship when there may be multiple. For example in the db we have:

 MySQL  localhost:3306  soybase  SQL > select * from qtl_position_table where QTLName='Seed protein 4-4';
+-------+------------------+---------------------+----------+----------+------+----------+
| QTLID | QTLName          | MapName             | LeftEnd  | RightEnd | LG   | Centroid |
+-------+------------------+---------------------+----------+----------+------+----------+
|   540 | Seed protein 4-4 | GmRFLP-GA1996a_C1.1 |  0.00000 |     5.40 | C1.1 |     3.00 |
|   540 | Seed protein 4-4 | GmComposite1999_C1  | 32.10000 |    34.10 | C1   |    33.00 |
|   540 | Seed protein 4-4 | GmComposite2003_C1  | 20.04000 |    22.04 | C1   |    21.00 |
+-------+------------------+---------------------+----------+----------+------+----------+

but in the glyma.Young_x_PI416937.gen.Lee_Bailey_1996a.qtlmrk.tsv file we have only: Seed protein 4-4 Seed protein A463_1 GmRFLP-GA1996a_C1.1

Now, it looks like it is pretty easy to simply restrict the target map to GmComposite2003 and report those as being the associations which avoids this complication. But, I did notice that some of the QTL end up getting "left behind" using this approach, e.g. Seed protein 4-12 which has only an experimental map position:

 MySQL  localhost:3306  soybase  SQL > select * from qtl_position_table where QTLName='Seed protein 4-12';
+-------+-------------------+-----------------------+---------+----------+--------+----------+
| QTLID | QTLName           | MapName               | LeftEnd | RightEnd | LG     | Centroid |
+-------+-------------------+-----------------------+---------+----------+--------+----------+
|   548 | Seed protein 4-12 | GmRFLP-GA1996a_GA1_25 | 0.00000 |    14.60 | GA1_25 |     7.00 |
+-------+-------------------+-----------------------+---------+----------+--------+----------+

Maybe this is OK, but I thought I should check and also verify that you weren't overlooking GmComposite1999; including it would be possible but would require that I actually fix the bug with multiple linkage group assignments rather than just avoid it by having only the one target map set. Let me know your thoughts when you get a chance.

maxglycine commented 3 months ago

Andrew @adf-ncgr the SoyBase Classic has only been displaying the Composite2003 genetic map positions and suppressing any mention of the "other" maps, so doing that in the mines are OK with me. Yes, it means that some QTL will not have a position on the Composite2003 map. We have two ways to go. One way is to ignore anything W/O a Composite2003 position the other is to return the unique list of all QTL and any Composite2003 position related to them. Understanding that not all QTL will have a position. I would personally come down on option 2, the unique list of QTL with any Composite 2003 position for each QTL. Since I don't know the data model and path queries it would be up to you to tell me what is possible. @StevenCannon-USDA @jd-campbell