Closed sammyjava closed 2 years ago
This makes sense to me. In a map collection, I would expect to find a file, maybe of type "map", that has these fields: marker_ID linkage_group cM_position What other files would be needed?
That's the mrk.tsv file, which we currently have in the specific publication collections. The other one is lg.tsv which has the names of the linkage groups and their lengths. There are no new files, it's just moving the mrk.tsv and lg.tsv into single /maps/ collections and leaving qtl.gsv and qtlmrk.tsv and obo.tsv in the publication-specific collections.
So, as a concrete example, the proposed collection /maps/SoyBase.map.GmComposite2003, or however we decide to name it, would contain: a genetic map called GmComposite2003 defined by two files other than the README:
glyma.SoyBase.map.GmComposite2003.lg.tsv
linkage_group | length |
---|---|
GmComposite2003_A1 | 102.30 |
GmComposite2003_A2 | 165.72 |
GmComposite2003_B1 | 131.81 |
GmComposite2003_B1S | 6.70 |
GmComposite2003_B2 | 120.98 |
GmComposite2003_B2P | 66.40 |
GmComposite2003_C1 | 135.62 |
GmComposite2003_C2 | 157.89 |
GmComposite2003_D1-b | 75.80 |
GmComposite2003_D1a | 120.89 |
GmComposite2003_D1b | 137.97 |
GmComposite2003_D2 | 134.25 |
GmComposite2003_E | 78.49 |
GmComposite2003_F | 102.80 |
GmComposite2003_G | 116.76 |
GmComposite2003_H | 124.05 |
GmComposite2003_I | 125.18 |
GmComposite2003_J | 114.40 |
GmComposite2003_K | 117.01 |
GmComposite2003_L | 115.07 |
GmComposite2003_M | 142.18 |
GmComposite2003_N | 136.00 |
GmComposite2003_O | 138.38 |
GmComposite2003_Q | 26.13 |
glyma.SoyBase.map.GmComposite2003.mrk.tsv
marker | linkage_group | position |
---|---|---|
138GA26 | GmComposite2003_O | 105.4 |
A006_1 | GmComposite2003_B1 | 64.77 |
A007_1 | GmComposite2003_I | 95.14 |
A018_1 | GmComposite2003_B2 | 53.54 |
A020_1 | GmComposite2003_G | 55.67 |
A023_1 | GmComposite2003_L | 36.7 |
A036_1 | GmComposite2003_H | 34.29 |
A043_1 | GmComposite2003_B2 | 11.35 |
A053_1 | GmComposite2003_E | 21.05 |
A053_2 | GmComposite2003_A1 | 34.55 |
... | ... | ... |
I see. Yes, that looks sensible.
I'm kinda tempted to go forward on this. The new SoyMine is broken, slightly (Lee.gnm1 got all its scaffolds loaded as chromosomes because of an extra dot) and if I'm going to do a new rebuild this would be a nice update. It's really just moving the lg.tsv and mrk.tsv files to new collections under /maps/ but the loading/merging then changes. If you look at a "Genetic Map" (aka QTL experiment) you'll see the markers/QTLs for just that experiment on the consensus LGs. Maybe a couple of seed weight QTLs. But if you look at the LG report in this new design, you'll see ALL the QTLs on that LG from all of the experiments. That can be a fancy cmap-style viewer since it'll have a lot of data. I can't do that in a displayer with the current data model.
I am fine with this change. If we do it for one, we should do it for all (species with maps). Let me know if you'd like me to join in.
Yeah it's a Datastore specification update so it has to be applied to all collections across the board. But I can handle it as I build the mines, it's actually pretty easy, moving lg.tsv and mrk.tsv files and changing names, tweaking READMEs that already exist.
The hardest part is adding the genetic_map attribute into 273 soybean READMEs because I have to tweak them after I've built them from the MySQL, but I've done that twice now, it's not that horrible. It actually helps that the READMEs are in Github, for subtle but handy reasons. (I use git status constantly to help my data file updates.)
I'm gonna go ahead with this starting with a datastore-specifications update.
FWIW the new SoyMine is up and running in production, it's 99.9% fine, but I'm going to wait for this new build to say stuff to everyone because I think it'll be a lot spiffier, and will then be ready for my long-awaited cmap-js implementation. But the new SoyMine with its 21 Glycine assembly/annotations is public.
Oh one more thing on this, @cann0010, this update will change what I load from a /genetic/ collection into a mine, since a /maps/ collection will load into GeneticMap. I think I'll call it "QTL Study" for similarity to "Genome Wide Association Study." (Also I think of an "experiment" as a smaller thing than a "study," which is what you publish. But speak up if you like another name for it.)
The recent massive import of SoyBase QTL data results in 98
GmComposite2003_C2
objects which are, of course, the same thing.One way to fix this is to do what we did for marker sets, where we denote a
genotyping_platform
, e.g. SoySNP6K, which denotes the collections under /markers/ which are used in multiple /genetic/ collections.The equivalent for genetic maps would be to go back to having a /maps/ directory holding a unique set of genetic maps (but NOT QTLs or other trait information!), for example,
which would hold those linkage groups, like
GmComposite2003_C2
.The README for the /genetic/ collections would then require a
genetic_map
attribute. Many species will have genetic_maps that came from a solo publication, just like we have for markers, likeWm82.gnm2.mrk.Sonah_ODonoughue_2015
.With this change, if you looked at 'GmComposite2003_C2' in SoyMine you'd see the QTLs from all the publications that associated QTLs with that linkage group. Right now you have to build a query that searches on that linkage group name.
This is particularly relevant for SoyMine, but may play a role for other species that use consensus or shared maps.