jotech / gapseq

Informed prediction and analysis of bacterial metabolic pathways and genome-scale networks
GNU General Public License v3.0
161 stars 33 forks source link

Question: GapSeq includes a protein as being present when it got a bad blast? #198

Closed Calvin2077 closed 1 year ago

Calvin2077 commented 1 year ago

Good day,

My name is Calvin, and my overall question is, why did GapSeq include/accept a protein as being present when it got a bad blast?

In more detail, I am using GapSeq to assess the metabolic pathways for a large group of Archaea. One group of pathways we are particularly interested in are Polyamines.

So I ran GapSeq using a bit-score of 60 and -z 2 (unreviewed only if reviewed not available), which all went well with no issues.

But, when I checked my results, "Putrescine Biosynthesis I" stood out as being complete in my species, which is strange as previous analysis (blast and KEGG) showed that my species lacked the "arginine decarboxylase" to make it happen.

Therefore, I checked the Reaction file (output from GapSeq), and I found that for all three examples of "arginine decarboxylase," they were labelled as "bad_blast" (which, according to the GapSeq wiki says it was due to blast hit with lower quality, i.e. lower bit-score, coverage, or identity than needed for cutoffs)

All three had e-values = 5.82e-24 bit-scores = 86.3

Therefore, my question is, why did GapSeq include/accept "arginine decarboxylase" as being present when I got a bad blast?

Is it due to the bit-score being higher than my threshold? However, oddly, when I checked another protein with a "bad blast" that got a similar value (e-value = 5.40e-21; bit-score = 89.7), the Pathway file said it was absent.

Any help would be much appreciated. Thanks

Calvin

Waschina commented 1 year ago

Hi Calvin,

I guess the reaction is added during the gap-filling. There, reactions with "blad_blast" might be added if they complete a metabolic functionality, e.g. a pathway to use a specific compound as an energy source. By default, gapseq considers reactions with a bitscore > 50 for those gap fillings.

You can make the gap-filling step more strict by using the option "-b". E.g.

gapseq fill -m <genomeID>-draft.RDS -n <genomeID>-medium.csv -c <genomeID>-rxnWeights.RDS -g <genomeID>-rxnXgenes.RDS -b 100

This will reset the bitscore threshold during gap-filling to 100 instead of 50.

I hope this helps. Silvio

Calvin2077 commented 1 year ago

Hello Silvio,

Thank you very much for getting back to me so fast as it is much appreciated and means a lot so thank you. Also thank you for your help in my question!

However one thing I accidentally forgot to include in my question is I only ran "Gap-seq find" and I did not run gapseq fill or any other downstream steps.

Thanks and sorry for the confusion

Waschina commented 1 year ago

I see. Could you perhaps share the part of the table "...-all-Reactions.tbl" with the said reaction? Ideally all rows for the pathway, which contains the reaction.

Calvin2077 commented 1 year ago

Of course here you go!

Screenshot 2023-11-09 at 2 35 17 PM
Calvin2077 commented 1 year ago

Also here is a screen shot showing that the pathway is complete

Screenshot 2023-11-09 at 2 37 10 PM
jotech commented 1 year ago

Thank you for reporting this issue! Can you please also post the output of

./gapseq find -p PWY-40 -b 60 -z 2 your.genome.fna
Calvin2077 commented 1 year ago

You're welcome and thank you for your help and assistance it is much appreciated and of course! It is the first reaction 1) ARGDECARBOX-RXN arginine decarboxylase, that is being problematic (GapSeq says it is present despite having Bad_Blast!

Screenshot 2023-11-10 at 9 31 35 AM
jotech commented 1 year ago

okay, thank you! That the reactions are found makes sense since you lowered the bitscore. However, it is still unclear why it is labeled as a "bad blast" hit.

Calvin2077 commented 1 year ago

Okay so the problem lies in lowering the bit-score this does make sense so thank you. However, when I checked another protein/enzyme (ORNDECARBOX-RXN ornithine decarboxylase 4.1.1.17 ) that has a bit-score of 89.7 the output of GapSeq says it is absent. Please see attached photo

Screenshot 2023-11-10 at 2 27 37 PM
Waschina commented 1 year ago

Hi Calvin,

I tried to reproduce the issue with an Nitrosothermus koennekii assembly from NCBI. Everything was just as expected: There were no bad_blast hits for Pathway PWY-40.

Could you run the following commands and post the output?

gapseq find -p PWY-40 -t Archaea -b 60 Nitrosothermus_koennekii.faa
cat Nitrosothermus_koennekii-PWY-40-Reactions.tbl
cat Nitrosothermus-PWY-40-Pathways.tbl

For me, the output was:

/tmp/tmp.S1WcusPC8v
Protein fasta detected.
Checking updates for Archaea /home/silvio/Software/gapseq/src/../dat/seq/Archaea
Reference sequences are up-to-date.
metacyc,custom
Duplicated pathway IDs found: |ARGDEGRAD-PWY2| |NADPHOS-DEPHOS-PWY| will only use /home/silvio/Software/gapseq/src/../dat/custom_pwy.tbl
\|ARGDEGRAD-PWY2\||\|NADPHOS-DEPHOS-PWY\|
1
Checking for pathways and reactions in: Nitrosothermus.faa PWY-40
Number of pathways to be considered: 1

1/1: Checking for pathway |PWY-40| putrescine biosynthesis I with 2 reactions
(Biosynthesis,Polyamine-Biosynthesis,Putrescine-Biosynthesis)
    1) ARGDECARBOX-RXN arginine decarboxylase, biosynthetic 4.1.1.19 speA speA SPE1 MADC2 MADC3 ADC ADC pdaD speA ADC adiA speA
        --> Found sequences: /home/silvio/Software/gapseq/src/../dat/seq/Archaea/rev/4.1.1.19.fasta (32 sequences)
        Final file: /tmp/tmp.S1WcusPC8v/tmp.BE8ElyVNwY (32 sequences)
        Blast hit (2x)
            bit=92.0 id=40.000 cov=80 hit=UniRef90_A8AAB6
            bit=81.6 id=32.456 cov=85 hit=UniRef90_A8MBV3
        Candidate reaction for import: 1
    2) AGMATIN-RXN agmatinase 3.5.3.11 speB AGMAT speB
        --> Found sequences: /home/silvio/Software/gapseq/src/../dat/seq/Archaea/rev/3.5.3.11.fasta (3 sequences)
        Final file: /tmp/tmp.S1WcusPC8v/tmp.brVixrq3ku (3 sequences)
        Blast hit (1x)
            bit=146 id=35.433 cov=88 hit=UniRef90_Q5JI38
        Candidate reaction for import: 1
Pathway completeness: 2/2 (100%)
Hits with candidate reactions in database: 2/2
Key reactions: 0/0

Pathways found:
putrescine biosynthesis I

Candidate reactions found: 2 

%CPU %MEM COMMAND
 4.0  0.0 /bin/bash /home/silvio/Software/gapseq/src/gapseq_find.sh -p PWY-40 -t
Running time: 1 s.

# gapseq version: 1.2 6225829
# Sequence DB md5sum: 748b938 (2022-12-19, Archaea)
# Genome format: prot
rxn name    ec  bihit   qseqid  pident  evalue  bitscore    qcovs   stitle  sstart  send    pathway statuspathway.status    dbhit   complex exception   complex.status
ARGDECARBOX-RXN arginine decarboxylase, biosynthetic    4.1.1.19        UniRef90_A8AAB6 40.000  2.30e-26    92.0    80  RMF31093.1 MAG: adenosylmethionine decarboxylase [Candidatus Nitrosothermus koennekii]  3   116 |PWY-40|    good_blast  full    rxn00405    NA  0   1
ARGDECARBOX-RXN arginine decarboxylase, biosynthetic    4.1.1.19        UniRef90_A8MBV3 32.456  1.79e-22    81.6    85  RMF31093.1 MAG: adenosylmethionine decarboxylase [Candidatus Nitrosothermus koennekii]  3   114 |PWY-40|    good_blast  full    rxn00405    NA  0   1
AGMATIN-RXN agmatinase  3.5.3.11        UniRef90_Q5JI38 35.433  7.94e-44    146 88  RMF31931.1 MAG: agmatinase [Candidatus Nitrosothermus koennekii]    24  272 |PWY-40|    good_blast  full    rxn00858    NA  0   1

# gapseq version: 1.2 6225829
# Sequence DB md5sum: 748b938 (2022-12-19, Archaea)
# Genome format: prot
ID  Name    Prediction  Completeness    VagueReactions  KeyReactions    KeyReactionsFound   ReactionsFound
|PWY-40|    putrescine biosynthesis I   true    100 0   0   0   ARGDECARBOX-RXN AGMATIN-RXN 
Calvin2077 commented 1 year ago

Hello!

Thank you for reproducing my data/.faa to try and find the source of the error. It is much appreciated!

As requested here are the photos of my output (.tsv instead of cat for more clarity for me) and like yourself I am getting no "bad_blast" for PWY-40. Which is incredible strange and odd I have likewise tried running GapSeq on the entire "Polyamine-Biosynthesis" and am still getting now a "good_blast" (in other words no "bad_blast")

Very odd but seems like in the end I think the major source of the problem (in why this protein was included when other analysis didn't) might be related to the low bit-score I set up.

I tried to reproduce the issue with an Nitrosothermus koennekii assembly from NCBI. Everything was just as expected: There were no bad_blast hits for Pathway PWY-40.

Could you run the following commands and post the output?

Screenshot 2023-11-11 at 1 52 15 PM Screenshot 2023-11-11 at 1 50 42 PM Screenshot 2023-11-11 at 1 50 28 PM
Calvin2077 commented 1 year ago

However, if it is okay, I just have another confusing find in that ORNDECARBOX-RXN ornithine decarboxylase 4.1.1.17 (please see the attached photos) says it is has a "bad_blast" yet the bit-score is above my 60 threshold (bit-score 86.9), whereas the output says ARGINASE-RXN (arginase) is present.

One idea, I am thinking it is related to ORNDECARBOX-RXN ornithine decarboxylase having "bihit" as NA while ARGINASE-RXN (arginase) doesn't. Therefore do you think this the reason? If so, and I apologize (as I am still pretty new to bioinfomatics) but what does "bihit" {Bidirectional hit} mean?

Thanks again for all your help it is so very much appreciated!

Screenshot 2023-11-11 at 2 11 57 PM Screenshot 2023-11-11 at 2 12 43 PM
Waschina commented 1 year ago

Hi Calvin,

unfortunately, your screenshot does not show the entire table (That is why I suggested the output of the cat command). You will see that the column "exception" says 1 for EC 4.1.1.17. This means this enzyme requires a higher identity to be identified as "good_blast". The gapseq publication in the section "Pathway and subsystem prediction" explains the exception handling.

One small note: When posting code blocks or parts of command line log outputs, you could consider posting the text in code blocks (https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks) instead of screenshots.

Waschina commented 1 year ago

I am closing the issue for now as we can not reproduce the described problem. Please feel free to re-open if issue remains.