legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: add genetic_map column to qtl.tsv and qtlmrk.tsv files in QTL collections #33

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

So we did the update to not reproduce the genetic map (listed in the README) as a prefix of all the LG identifiers (so they look like they did in the publications). This was heavily motivated by Phaseolus work I was doing (as I recall) but seems like a good idea in the "make things recognizable from the publication" approach. Of course I was thinking that QTL studies use a single genetic map.

But, of course, now I'm updating the Glycine QTL studies and maps, and I have 52 QTL studies with multiple genetic maps like this:

Young_x_PI416937.qtl.Bailey_Mian_1997 genetic_map: GmComposite2003,GmRFLP-GA1996a,GmComposite1999

a study that places QTLs on three different genetic maps (this is a previous-style file with the LG prefixes):

#qtl_identifier     trait_name  linkage_group       start   end peak
Pod dehiscence 1-9  Pod dehiscence  GmRFLP-GA1996a_J.2  9.25    18.75   14.0
Pod dehiscence 1-1  Pod dehiscence  GmComposite1999_E   92.3    94.3    93.0
Pod dehiscence 1-8  Pod dehiscence  GmRFLP-GA1996a_J.2  0.0 18.5    9.0
Pod dehiscence 1-3  Pod dehiscence  GmComposite1999_E   90.2    92.2    91.0
Pod dehiscence 1-2  Pod dehiscence  GmComposite1999_E   94.3    96.3    95.0
Pod dehiscence 1-10 Pod dehiscence  GmComposite1999_L   109.5   111.5   111.0
Pod dehiscence 1-4  Pod dehiscence  GmComposite1999_E   93.2    95.2    94.0
Pod dehiscence 1-7  Pod dehiscence  GmComposite2003_J   56.2    58.2    57.0
Pod dehiscence 1-6  Pod dehiscence  GmComposite2003_J   26.63   28.63   28.0
Pod dehiscence 1-5  Pod dehiscence  GmComposite2003_J   16.35   18.35   17.0

Definitely qualifies for the "funky" label.

There are 52 Glycine QTL studies with multiple maps, 261 that do not, imported from the soybase mysql.

My proposed solution is to simply add a genetic_map column to the qtl.tsv files (so we know which map those LGs are on, since we don't know from the genetic_map: attribute in the README) and the qtlmrk.tsv files (so we know on which map the markers were placed to determine the QTL): qtl.tsv

#qtl_identifier     trait_name  genetic_map linkage_group   start   end peak
Pod dehiscence 1-9  Pod dehiscence  GmRFLP-GA1996a  J.2 9.25    18.75   14.0
Pod dehiscence 1-1  Pod dehiscence  GmComposite1999 E   92.3    94.3    93.0
Pod dehiscence 1-8  Pod dehiscence  GmRFLP-GA1996a  J.2 0.0 18.5    9.0
Pod dehiscence 1-3  Pod dehiscence  GmComposite1999 E   90.2    92.2    91.0
Pod dehiscence 1-2  Pod dehiscence  GmComposite1999 E   94.3    96.3    95.0
Pod dehiscence 1-10 Pod dehiscence  GmComposite1999 L   109.5   111.5   111.0
Pod dehiscence 1-4  Pod dehiscence  GmComposite1999 E   93.2    95.2    94.0
Pod dehiscence 1-7  Pod dehiscence  GmComposite2003 J   56.2    58.2    57.0
Pod dehiscence 1-6  Pod dehiscence  GmComposite2003 J   26.63   28.63   28.0
Pod dehiscence 1-5  Pod dehiscence  GmComposite2003 J   16.35   18.35   17.0

qltmrk.tsv

#qtl_identifier trait_name  marker  genetic_map linkage_group
Pod dehiscence 1-1  Pod dehiscence  BLT049_5    GmComposite1999 E
Pod dehiscence 1-2  Pod dehiscence  cr324_1 GmComposite1999 E
Pod dehiscence 1-3  Pod dehiscence  B124_3  GmComposite1999 E
Pod dehiscence 1-4  Pod dehiscence  cr274_1 GmComposite1999 E
Pod dehiscence 1-5  Pod dehiscence  B074_1  GmComposite2003 J
Pod dehiscence 1-6  Pod dehiscence  B166_1  GmComposite2003 J
Pod dehiscence 1-7  Pod dehiscence  B122_1  GmComposite2003 J
Pod dehiscence 1-8  Pod dehiscence  K375_1  GmRFLP-GA1996a  J.2
Pod dehiscence 1-9  Pod dehiscence  cr392_1 GmRFLP-GA1996a  J.2
Pod dehiscence 1-10 Pod dehiscence  A489_1  GmComposite1999 L

This may be only a Glycine issue, but it seems worthwhile to support multi-map QTL studies.

Note that this is a file format and loader change; the database model does NOT change since LGs and markers are already associated with a genetic map.

It's effectively what we had before, but with the genetic map being a column rather than an LG identifier prefix. (And super easy to implement as soybase always uses the underscore separator!)

sammyjava commented 1 year ago

I know this is a tiny change, but any opposition, gents? @cann0010 @svengato

sammyjava commented 1 year ago

Thanks for the thumbs up, I'm going for it since I want to keep going on Glycine.