legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

Create a genetic_map attribute to associate genetic collections with a single genetic map under /maps/? #23

Closed sammyjava closed 2 years ago

sammyjava commented 2 years ago

The recent massive import of SoyBase QTL data results in 98 GmComposite2003_C2 objects which are, of course, the same thing.

One way to fix this is to do what we did for marker sets, where we denote a genotyping_platform, e.g. SoySNP6K, which denotes the collections under /markers/ which are used in multiple /genetic/ collections.

The equivalent for genetic maps would be to go back to having a /maps/ directory holding a unique set of genetic maps (but NOT QTLs or other trait information!), for example,

/maps/SoyBase.map.GmComposite2003/

which would hold those linkage groups, like GmComposite2003_C2.

The README for the /genetic/ collections would then require a genetic_map attribute. Many species will have genetic_maps that came from a solo publication, just like we have for markers, like Wm82.gnm2.mrk.Sonah_ODonoughue_2015.

With this change, if you looked at 'GmComposite2003_C2' in SoyMine you'd see the QTLs from all the publications that associated QTLs with that linkage group. Right now you have to build a query that searches on that linkage group name.

This is particularly relevant for SoyMine, but may play a role for other species that use consensus or shared maps.

StevenCannon-USDA commented 2 years ago

This makes sense to me. In a map collection, I would expect to find a file, maybe of type "map", that has these fields: marker_ID linkage_group cM_position What other files would be needed?

sammyjava commented 2 years ago

That's the mrk.tsv file, which we currently have in the specific publication collections. The other one is lg.tsv which has the names of the linkage groups and their lengths. There are no new files, it's just moving the mrk.tsv and lg.tsv into single /maps/ collections and leaving qtl.gsv and qtlmrk.tsv and obo.tsv in the publication-specific collections.

So, as a concrete example, the proposed collection /maps/SoyBase.map.GmComposite2003, or however we decide to name it, would contain: a genetic map called GmComposite2003 defined by two files other than the README:

glyma.SoyBase.map.GmComposite2003.lg.tsv

linkage_group length
GmComposite2003_A1 102.30
GmComposite2003_A2 165.72
GmComposite2003_B1 131.81
GmComposite2003_B1S 6.70
GmComposite2003_B2 120.98
GmComposite2003_B2P 66.40
GmComposite2003_C1 135.62
GmComposite2003_C2 157.89
GmComposite2003_D1-b 75.80
GmComposite2003_D1a 120.89
GmComposite2003_D1b 137.97
GmComposite2003_D2 134.25
GmComposite2003_E 78.49
GmComposite2003_F 102.80
GmComposite2003_G 116.76
GmComposite2003_H 124.05
GmComposite2003_I 125.18
GmComposite2003_J 114.40
GmComposite2003_K 117.01
GmComposite2003_L 115.07
GmComposite2003_M 142.18
GmComposite2003_N 136.00
GmComposite2003_O 138.38
GmComposite2003_Q 26.13

glyma.SoyBase.map.GmComposite2003.mrk.tsv

marker linkage_group position
138GA26 GmComposite2003_O 105.4
A006_1 GmComposite2003_B1 64.77
A007_1 GmComposite2003_I 95.14
A018_1 GmComposite2003_B2 53.54
A020_1 GmComposite2003_G 55.67
A023_1 GmComposite2003_L 36.7
A036_1 GmComposite2003_H 34.29
A043_1 GmComposite2003_B2 11.35
A053_1 GmComposite2003_E 21.05
A053_2 GmComposite2003_A1 34.55
... ... ...
StevenCannon-USDA commented 2 years ago

I see. Yes, that looks sensible.

sammyjava commented 2 years ago

I'm kinda tempted to go forward on this. The new SoyMine is broken, slightly (Lee.gnm1 got all its scaffolds loaded as chromosomes because of an extra dot) and if I'm going to do a new rebuild this would be a nice update. It's really just moving the lg.tsv and mrk.tsv files to new collections under /maps/ but the loading/merging then changes. If you look at a "Genetic Map" (aka QTL experiment) you'll see the markers/QTLs for just that experiment on the consensus LGs. Maybe a couple of seed weight QTLs. But if you look at the LG report in this new design, you'll see ALL the QTLs on that LG from all of the experiments. That can be a fancy cmap-style viewer since it'll have a lot of data. I can't do that in a displayer with the current data model.

StevenCannon-USDA commented 2 years ago

I am fine with this change. If we do it for one, we should do it for all (species with maps). Let me know if you'd like me to join in.

sammyjava commented 2 years ago

Yeah it's a Datastore specification update so it has to be applied to all collections across the board. But I can handle it as I build the mines, it's actually pretty easy, moving lg.tsv and mrk.tsv files and changing names, tweaking READMEs that already exist.

The hardest part is adding the genetic_map attribute into 273 soybean READMEs because I have to tweak them after I've built them from the MySQL, but I've done that twice now, it's not that horrible. It actually helps that the READMEs are in Github, for subtle but handy reasons. (I use git status constantly to help my data file updates.)

I'm gonna go ahead with this starting with a datastore-specifications update.

FWIW the new SoyMine is up and running in production, it's 99.9% fine, but I'm going to wait for this new build to say stuff to everyone because I think it'll be a lot spiffier, and will then be ready for my long-awaited cmap-js implementation. But the new SoyMine with its 21 Glycine assembly/annotations is public.

sammyjava commented 2 years ago

Oh one more thing on this, @cann0010, this update will change what I load from a /genetic/ collection into a mine, since a /maps/ collection will load into GeneticMap. I think I'll call it "QTL Study" for similarity to "Genome Wide Association Study." (Also I think of an "experiment" as a smaller thing than a "study," which is what you publish. But speak up if you like another name for it.)