Create a genetic_map attribute to associate genetic collections with a single genetic map under /maps/?

legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore

2 stars 0 forks source link

Create a genetic_map attribute to associate genetic collections with a single genetic map under /maps/? #23

Closed sammyjava closed 2 years ago

sammyjava commented 2 years ago

The recent massive import of SoyBase QTL data results in 98 GmComposite2003_C2 objects which are, of course, the same thing.

One way to fix this is to do what we did for marker sets, where we denote a genotyping_platform, e.g. SoySNP6K, which denotes the collections under /markers/ which are used in multiple /genetic/ collections.

The equivalent for genetic maps would be to go back to having a /maps/ directory holding a unique set of genetic maps (but NOT QTLs or other trait information!), for example,

/maps/SoyBase.map.GmComposite2003/

which would hold those linkage groups, like GmComposite2003_C2.

The README for the /genetic/ collections would then require a genetic_map attribute. Many species will have genetic_maps that came from a solo publication, just like we have for markers, like Wm82.gnm2.mrk.Sonah_ODonoughue_2015.

With this change, if you looked at 'GmComposite2003_C2' in SoyMine you'd see the QTLs from all the publications that associated QTLs with that linkage group. Right now you have to build a query that searches on that linkage group name.

This is particularly relevant for SoyMine, but may play a role for other species that use consensus or shared maps.

StevenCannon-USDA commented 2 years ago

This makes sense to me. In a map collection, I would expect to find a file, maybe of type "map", that has these fields: marker_ID linkage_group cM_position What other files would be needed?

sammyjava commented 2 years ago

That's the mrk.tsv file, which we currently have in the specific publication collections. The other one is lg.tsv which has the names of the linkage groups and their lengths. There are no new files, it's just moving the mrk.tsv and lg.tsv into single /maps/ collections and leaving qtl.gsv and qtlmrk.tsv and obo.tsv in the publication-specific collections.

So, as a concrete example, the proposed collection /maps/SoyBase.map.GmComposite2003, or however we decide to name it, would contain: a genetic map called GmComposite2003 defined by two files other than the README:

glyma.SoyBase.map.GmComposite2003.lg.tsv

linkage_group	length
GmComposite2003_A1	102.30
GmComposite2003_A2	165.72
GmComposite2003_B1	131.81
GmComposite2003_B1S	6.70
GmComposite2003_B2	120.98
GmComposite2003_B2P	66.40
GmComposite2003_C1	135.62
GmComposite2003_C2	157.89
GmComposite2003_D1-b	75.80
GmComposite2003_D1a	120.89
GmComposite2003_D1b	137.97
GmComposite2003_D2	134.25
GmComposite2003_E	78.49
GmComposite2003_F	102.80
GmComposite2003_G	116.76
GmComposite2003_H	124.05
GmComposite2003_I	125.18
GmComposite2003_J	114.40
GmComposite2003_K	117.01
GmComposite2003_L	115.07
GmComposite2003_M	142.18
GmComposite2003_N	136.00
GmComposite2003_O	138.38
GmComposite2003_Q	26.13

glyma.SoyBase.map.GmComposite2003.mrk.tsv

marker	linkage_group	position
138GA26	GmComposite2003_O	105.4
A006_1	GmComposite2003_B1	64.77
A007_1	GmComposite2003_I	95.14
A018_1	GmComposite2003_B2	53.54
A020_1	GmComposite2003_G	55.67
A023_1	GmComposite2003_L	36.7
A036_1	GmComposite2003_H	34.29
A043_1	GmComposite2003_B2	11.35
A053_1	GmComposite2003_E	21.05
A053_2	GmComposite2003_A1	34.55
...	...	...

StevenCannon-USDA commented 2 years ago

I see. Yes, that looks sensible.

sammyjava commented 2 years ago

I'm kinda tempted to go forward on this. The new SoyMine is broken, slightly (Lee.gnm1 got all its scaffolds loaded as chromosomes because of an extra dot) and if I'm going to do a new rebuild this would be a nice update. It's really just moving the lg.tsv and mrk.tsv files to new collections under /maps/ but the loading/merging then changes. If you look at a "Genetic Map" (aka QTL experiment) you'll see the markers/QTLs for just that experiment on the consensus LGs. Maybe a couple of seed weight QTLs. But if you look at the LG report in this new design, you'll see ALL the QTLs on that LG from all of the experiments. That can be a fancy cmap-style viewer since it'll have a lot of data. I can't do that in a displayer with the current data model.

StevenCannon-USDA commented 2 years ago

I am fine with this change. If we do it for one, we should do it for all (species with maps). Let me know if you'd like me to join in.

sammyjava commented 2 years ago

Yeah it's a Datastore specification update so it has to be applied to all collections across the board. But I can handle it as I build the mines, it's actually pretty easy, moving lg.tsv and mrk.tsv files and changing names, tweaking READMEs that already exist.

The hardest part is adding the genetic_map attribute into 273 soybean READMEs because I have to tweak them after I've built them from the MySQL, but I've done that twice now, it's not that horrible. It actually helps that the READMEs are in Github, for subtle but handy reasons. (I use git status constantly to help my data file updates.)

I'm gonna go ahead with this starting with a datastore-specifications update.

FWIW the new SoyMine is up and running in production, it's 99.9% fine, but I'm going to wait for this new build to say stuff to everyone because I think it'll be a lot spiffier, and will then be ready for my long-awaited cmap-js implementation. But the new SoyMine with its 21 Glycine assembly/annotations is public.

sammyjava commented 2 years ago

Oh one more thing on this, @cann0010, this update will change what I load from a /genetic/ collection into a mine, since a /maps/ collection will load into GeneticMap. I think I'll call it "QTL Study" for similarity to "Genome Wide Association Study." (Also I think of an "experiment" as a smaller thing than a "study," which is what you publish. But speak up if you like another name for it.)