Desi vs. Kabuli gene assignment into families/trees: problem in Conekt

sdash-github commented 4 years ago

"... ... it looks like the chickpea desi genes have been inappropriately assigned into the families/trees, probably through kabuli/desi identity switching at some point. In the legfed trees as represented in legumeinfo, only CDCFrontier was included in the family+tree building (Steven's choice). However, here's a tree in Conekt that seems to have a desi member: https://conekt.legumeinfo.org/tree/view/6975 however, when you look at the descriptors, it seems clear that the CDCFrontier gene Ca_13056 is the one that is really part of this family, while https://legumeinfo.org/chado_phylotree/legfed_v1_0.L_CSRBJ4?hilite_node=cicar.CDCFrontier.Ca_13056.1

the desi Ca_13056 has been assigned to a completely different family (after the families were constructed and without being included in the tree): https://legumeinfo.org/feature/Cicer/arietinum_ICC4958/gene/cicar.ICC4958.gnm2.ann1.Ca_13056

So, I guess for purposes of Conekt it would probably be better to use kabuli, even if the samples are desi. Alternatively, we might be able to include both, but desi would not be in the trees (assuming we use the current legfed build anyway)

let me know if you think I'm confused or have confused you. adf @adf-ncgr (Ref: email from adf: 2019-12-27::3:29 pm::LIS conekt VM)

sdash-github commented 4 years ago

In database trees table has only ICC (Desi) but no CDC (Kabuli) gene IDs

MariaDB [lis_conekt]> select count(*) from trees where data_newick like '%ICC%';
 14536 
select count(*) from trees where data_newick like '%CDC%';
 0

No Kabuli in sequences either

[lis_conekt]> select  count(name) from sequences where name like '%ICC%';      30257 
select  count(name) from sequences where name like '%CDC%'; 0

Tentative Conclusion:

During gene family loading into Conekt wrong association might have taken place or during tree loading ?? FIND OUT HOW ??

adf-ncgr commented 4 years ago

yes, but I think this must be due to some "munging" that was done, since the corresponding trees in legumeinfo have CDC and not ICC. The presence of ICC sequences is presumably due to your having chosen to load those instead of CDC (probably because the expression samples we have are desi). My guess is that you altered the legumeinfo trees to have cicar "leaves" that matched the sequences you loaded.

also note that if you remove the prefixes, the kabuli and desi genes have names that will match, so if prefixes were stripped from files for loading and then readded later, it might have caused this

sdash-github commented 4 years ago

Yes, something like this. I am trying to trace where the wrong association has taken place and why?? These are my notes as I trace back.

sdash-github commented 4 years ago

The legfed gene families have members both from Desi(ICC) and Kabuli(CDC)

I used a file created by extracting family names and their members from the chado database. Here the families have both the gene-sets as members.

The file was produced using sql below:

www/drupal7]$ psql -c "select concat(family_label,':'), STRING_AGG((SELECT uniquename from chado.feature where feature_id=gene_id), E'\t')  fam_members from gene_family_assignment GROUP BY family_label;" | less -NS

Just for note: I didn't use the DS gene family file because here the cicar genes din't have, at the time, the standard gene names with gensp.strain.gnm#.ann# etc.

sdash-github commented 4 years ago

Critical info from dev-conekt:~/'note2-customtree.txt'

During reconcile tree process:

    704 Error because gen neames don't match in databse and in trees file.
    705
    706 Should be:
    707 glyma.Wm82.gnm2.ann1.Glyma.01G000100 instead of glyma.Glyma.09G150600.1,
    708 cicar.ICC4958.gnm2.ann1.Ca_00001 instead of cicar.Ca_04562,
    709 phavu.G19833.gnm1.ann1.Phvul.001G000100 instead of phavu.Phvul.001G140400.1,
    710 vigun.IT97K-499-35.gnm1.ann1.Vigun01g000100 instead of vigun.Vigun03g161900.1, etc.

Therefore this was done: Replaced 'cicar.' to 'cicar.ICC4958.gnm2.ann1.' and HERE Kabuli gen names got converted to Desi gene names

    714 <$> cat legfed_v1_0DOTL_00CL8T.txt | perl -pie 's/(phavu.Phvul.)(.+?)(\.\d?)(:)/phavu.G19833.gnm1.ann1.Phvul.$2$4/g; s/cicar./cicar.ICC4958.gnm2.ann1./g; s/(glyma.Glyma.)(.+?)(\.\d?)(:)/glyma.Wm82.gnm2.ann1.Glyma.$2$4/g;  s/(vigun.Vigun)(.+?)(\.\d?)(:)/vigun.IT97K-499-35.gnm1.ann1.Vigun$2$4/g'

adf-ncgr commented 4 years ago

yes, the families have both desi and kabuli genes assigned (table gene_family_assignment), but in the trees (table phylotree) we only have kabuli. The trees were built by Steven, but for genomes that he didn't include (for whatever reason), I do a post-build assignment using the HMMs that represent the families to find the best match. I think in conekt there is also a distinction between families and trees, though perhaps for different reasons- at least I know it will support having gene families without having trees, not sure it would allow you to have members of gene families that are not in the trees for those gene families, but it might.

adf-ncgr commented 4 years ago

OK, it sounds like you've found the point at which this happened. I think we need to update the trees so that they reflect the fact that they are really kabuli- we may need to load kabuli genes into the db for this to work properly.

sdash-github commented 4 years ago

I think the whole tree file should be updated to reflect our standard gene names for all species and not just Kabuli. I used this file to start with: DS: data/public/Gene_families/legume.genefam.fam1.M65K//legume.genefam.fam1.M65K.trees_ML_rooted.tar.gz

The above file had names like this then: glyma.Glyma.09G150600.1, cicar.Ca_04562, phavu.Phvul.001G140400.1, vigun.Vigun03g161900.1, etc.

They should be like: glyma.Wm82.gnm2.ann1.Glyma.01G000100 instead of glyma.Glyma.09G150600.1, cicar.CDC???.gnm1.ann1.Ca_00001 instead of cicar.Ca_04562, phavu.G19833.gnm1.ann1.Phvul.001G000100 instead of phavu.Phvul.001G140400.1, vigun.IT97K-499-35.gnm1.ann1.Vigun01g000100 instead of vigun.Vigun03g161900.1, etc.

adf-ncgr commented 4 years ago

yes, this is a discussion Sam and I have also had recently; I may already have something approximating that, as I think I needed to do something similar for chado loading purposes (and before these were officially put into the datastore by Steven). Let me take a look and I will let you know what I dig up.

adf-ncgr commented 4 years ago

OK, it looks like the renaming I did for chado loading is not quite "full yuck" but does enough that it should make applying the remaining pieces of transformation needed to get to "full yuck" relatively easy, once I determine for certain which versions of certain genomes were included. Note that the names in the trees are not gene names, though, they are transcript isoform names. For most species this will just be a question of removing the .\d+ at the end but we do have at least one exception to this, red clover, where gene name/mRNA name pairs aren't easily relatable without an explicit lookup, e.g. tripr.Tp57577_TGAC_v2_gene2455 tripr.mRNA2521; this likely won't be an issue for CoNekT anytime soon, though.

sdash-github commented 2 years ago

legumeinfo / ConektLegumes