jotech / gapseq

Informed prediction and analysis of bacterial metabolic pathways and genome-scale networks
GNU General Public License v3.0
153 stars 32 forks source link

duplicate gene entries and similar reactions per pathway #69

Closed mmpust closed 3 years ago

mmpust commented 3 years ago

Hello,

  1. What is the difference between reactions 1 and 2 in the following output pathway? I mean if an organism has GLY1 and ltaE and can perform the first reaction then why not the second one?
  2. Why are some genes given several times? (reaction 3, "adhE"). Does this have biological meaning?
532/1779: Checking for pathway |PWY-5436| L-threonine degradation IV with 3 reactions
(Degradation,Amino-Acid-Degradation,Proteinogenic-Amino-Acids-Degradation,THREONINE-DEG)
        1) THREONINE-ALDOLASE-RXN L-threonine aldolase 4.1.2.48 GLY1 ltaE
                Merge sequence data from 4.1.2.48.fasta and THREONINE-ALDOLASE-RXN.fasta
                /tmp/tmp.fRALsUT19v (5 sequences)
                Blast hit (1x)
                        bit=515 id=60.976 cov=96 hit=UniRef90_D3DKC4
                NO candidate reaction found for import
        2) THREONINE-ALDOLASE-RXN L-threonine aldolase 4.1.2.5 GLY1 ltaE
                Merge sequence data from 4.1.2.5.fasta and THREONINE-ALDOLASE-RXN.fasta
                /tmp/tmp.DNuvIvu1BI (214 sequences)
                NO good blast hit
                        (best one: bit=97.1 id=26.329 cov=91)
        3) ACETALD-DEHYDROG-RXN acetaldehyde dehydrogenase 1.2.1.10 adhE adhE cmtH todI dmpF ADH1 eutE adhE mhpF eutE
                Merge sequence data from 1.2.1.10.fasta and ACETALD-DEHYDROG-RXN.fasta
                /tmp/tmp.9Cd84wKMOe (124 sequences)
                NO good blast hit
                        (best one: bit=26.9 id=22.800 cov=75)
Pathway completeness: 1/3 (33%)
Hits with candidate reactions in database: 0/3
Key reactions: 0/0

Thank you very much in advance! Marie

Waschina commented 3 years ago

Hi Marie,

these are valid questions :) Here's how to interpret these logs:

  1. The difference between reaction 1) and 2) is the EC number. According to MetaCyc, the reaction "THREONINE-ALDOLASE-RXN" is assigned to two different EC-numbers. Gapseq retrieves reference sequences for each EC-Number from Uniprot individually and merges these sequences with the reference sequences that are specified in MetaCyc directly to this reaction (which are in this case GLY1 and ltaE). Thus these two genes are tested for both reactions 1) and 2) but the reference sequences retrieved from UniProt are different between 1) and 2). In this specific case, the Blast-Hit is obtained from one of the Uniprot sequences and not from GLY1 or ItaE from MetaCyc.
  2. Those gene names correspond to the genes names in different organisms, which MetaCyc has associated to this specific reaction. Thus, it could happen that genes have the same name in different organisms and appear as duplicates in these gapseq logs.

Best Silvio

mmpust commented 3 years ago

Dear Silvio, thank you for this very fast and helpful response!

Answer 1: How do I interpret this biologically? If the genes GLY1 and ltaE are detected in a bacterial genome, then the bacterium has the genetic potential to undergo the THREONINE-ALDOLASE-RXN. In this case, the pathway completeness would be 50 %. and not 33 %, is that correct? Also, very often a pathway has just two reactions assigned to and both reactions only differ in their EC number but have the same genes and reaction name. So, gapseq assigns a 50% completeness value if the genome has sequences that are matching with one of the EC entries. However, the pathway is probably 100 % complete if the genes are present and a good Blast hit for one EC number is found? Does this make sense?

Answer 2: This is very clear now, thanks!

Kind regards, Marie

Waschina commented 3 years ago

Dear Marie, thanks for your careful evaluation :) In cases of two ECs for one reactions, we interpreted the individual ECs as sub-reactions of the overall reaction, which is why we splitted them in the prediction into two. But in the specific case you showed, it looks more like an instance of two different EC numbers, which catalyse both the same reaction (i.e. same stoichiometry) but with different specificity for the focal substrate. We will recheck how we are handling such cases and my do some adjustments.

mmpust commented 3 years ago

Thank you! Great. I just send you one other example (that occurs quite often that way). Maybe this helps in your evaluation process. Here, gapseq assigns a 50% completeness score but all the genes are present for both reactions.

89/1779: Checking for pathway |GLUTAMINDEG-PWY| L-glutamine degradation I with 2 reactions
(Degradation,Amino-Acid-Degradation,Proteinogenic-Amino-Acids-Degradation,GLUTAMINE-DEG)
        1) GLUTAMIN-RXN glutaminase 3.5.1.38 GLS Gls2 SNZ1 SNO1 ybaS yneH asnB glsA glsB
                Merge sequence data from 3.5.1.38.fasta and GLUTAMIN-RXN.fasta
                /tmp/tmp.vXA36kq4Pj (8 sequences)
                Blast hit (1x)
                        bit=241 id=42.667 cov=83 hit=UniRef90_O68897
                Candidate reaction for import: 4
        2) GLUTAMIN-RXN glutaminase 3.5.1.2 GLS Gls2 SNZ1 SNO1 ybaS yneH asnB glsA glsB
                Merge sequence data from 3.5.1.2.fasta and GLUTAMIN-RXN.fasta
                /tmp/tmp.VPwJaGFO5h (487 sequences)
                check subunits: 3
                total subunits found: 0 / 3
                NO hit because of missing subunits
Pathway completeness: 1/2 (50%)
Hits with candidate reactions in database: 1/2
Key reactions: 0/0

Kind regards, Marie

Waschina commented 3 years ago

Another interesting case:

https://metacyc.org/META/NEW-IMAGE?type=PATHWAY&object=PWY-6966

Here, the two EC-numbers refer to different subunits. Currently, gapseq does not predict the pathway/reaction e.g. for Methylococcus capsulatus Bath (GCF_000008325.1), for which the pathway is expected (#66).

jotech commented 3 years ago

hi @mmpust, sorry it took a bit longer to come up with a fix for this!

I revised the behavior of gapseq so that multiple metacyc EC numbers are treated as alternatives belonging to the same reactions now.

In your example PWY-5436 (L-threonine degradation IV), this means that only two reactions are considered and the pathway prediction should work more reasonably! The new approach should also work for more than two alternative EC numbers (e.g. ANAPHENOXI-PWY).

Hope it works as expected now. Thank you for pointing this out :)

mmpust commented 3 years ago

Awesome, thank you!