legumeinfo / datastore-specifications

Specifications for directory naming, file naming, file contents in the LIS datastore
2 stars 0 forks source link

RFO: rename linkage groups to short names used in publications (uniqueness by genetic map reference) #30

Closed sammyjava closed 1 year ago

sammyjava commented 1 year ago

I think I am responsible for the rather yucky names for linkage groups like: TT_SunOleic97R_x_NC94022_a-LGS1 which have shown themselves to result in self-inconsistencies plus they're inconsistent with how we're naming maps these days. This came from exporting from chado years ago and wanting to have unique LG identifiers.

I have now done a mine model update which keys LGs on (identifier,geneticMap) so "LGS1" can be used for different LGs from different maps. I think this is good because there tends to be a general consensus on what "LGS1" or "B01" means amongst the genetics community (typically corresponding to chromosome 1, but not always, of course).

So, in reference to https://github.com/legumeinfo/datastore-issues/issues/119 I'd like to rename the LGs throughout the DS to the short names used in the publications, given that I'm keeping them distinct in the mines via their geneticMap reference. I think this will make the mines more genetic-user friendly.

Objections, @cann0010 @adf-ncgr @svengato ?

StevenCannon-USDA commented 1 year ago

Just with respect to the "TT_" pre-prefix: that is legacy from ~7-8 years ago, when we were trying to distinguish tetraploid, A-genome diploid, and B-genome diploid maps. I favor dropping that. I am also OK with the more extreme shortening of LG names to e.g. "LGS1" ... as long as we have a way of uniquely identifying the maps themselves. Unfortunately, it is not uncommon for multiple maps to be generated from the same parents -- hence the _a and _b suffixes.

sammyjava commented 1 year ago

Yup, the genetic map files have the genotypes.map._author1_author2year naming (with a, b suffixes for the same genotypes and authors and years, typically from the same publication).

So, for example, the BAT93_x_JALOEEP558.map.Caldas_Blair_2009.lg.tsv file (for which I need to track down markers) will have linkage groups named:

#linkage_group  length
B04        94.35
B06        74.79
B07        67.82
B08        104.23
B10        79.23

which differ from, say, the time-honored BAT93_x_JALOEEP558.map.Freyre_Skroch_1998.lg.tsv:

#linkage_group  length
B01        107
B02        175
B03        132
B04        95
B05        72
B06        113
B07        109
B08        133
B09        105
B10        89
B11        100

So we always consider a linkage group along with the genetic map from which it came, which is in its filename.

sammyjava commented 1 year ago

This has been implemented.