Closed Calvin2077 closed 1 year ago
Hi Calvin,
I guess the reaction is added during the gap-filling. There, reactions with "blad_blast" might be added if they complete a metabolic functionality, e.g. a pathway to use a specific compound as an energy source. By default, gapseq considers reactions with a bitscore > 50 for those gap fillings.
You can make the gap-filling step more strict by using the option "-b". E.g.
gapseq fill -m <genomeID>-draft.RDS -n <genomeID>-medium.csv -c <genomeID>-rxnWeights.RDS -g <genomeID>-rxnXgenes.RDS -b 100
This will reset the bitscore threshold during gap-filling to 100 instead of 50.
I hope this helps. Silvio
Hello Silvio,
Thank you very much for getting back to me so fast as it is much appreciated and means a lot so thank you. Also thank you for your help in my question!
However one thing I accidentally forgot to include in my question is I only ran "Gap-seq find" and I did not run gapseq fill or any other downstream steps.
Thanks and sorry for the confusion
I see. Could you perhaps share the part of the table "...-all-Reactions.tbl" with the said reaction? Ideally all rows for the pathway, which contains the reaction.
Of course here you go!
Also here is a screen shot showing that the pathway is complete
Thank you for reporting this issue! Can you please also post the output of
./gapseq find -p PWY-40 -b 60 -z 2 your.genome.fna
You're welcome and thank you for your help and assistance it is much appreciated and of course! It is the first reaction 1) ARGDECARBOX-RXN arginine decarboxylase, that is being problematic (GapSeq says it is present despite having Bad_Blast!
okay, thank you! That the reactions are found makes sense since you lowered the bitscore. However, it is still unclear why it is labeled as a "bad blast" hit.
Okay so the problem lies in lowering the bit-score this does make sense so thank you. However, when I checked another protein/enzyme (ORNDECARBOX-RXN ornithine decarboxylase 4.1.1.17 ) that has a bit-score of 89.7 the output of GapSeq says it is absent. Please see attached photo
Hi Calvin,
I tried to reproduce the issue with an Nitrosothermus koennekii assembly from NCBI. Everything was just as expected: There were no bad_blast hits for Pathway PWY-40.
Could you run the following commands and post the output?
gapseq find -p PWY-40 -t Archaea -b 60 Nitrosothermus_koennekii.faa
cat Nitrosothermus_koennekii-PWY-40-Reactions.tbl
cat Nitrosothermus-PWY-40-Pathways.tbl
For me, the output was:
/tmp/tmp.S1WcusPC8v
Protein fasta detected.
Checking updates for Archaea /home/silvio/Software/gapseq/src/../dat/seq/Archaea
Reference sequences are up-to-date.
metacyc,custom
Duplicated pathway IDs found: |ARGDEGRAD-PWY2| |NADPHOS-DEPHOS-PWY| will only use /home/silvio/Software/gapseq/src/../dat/custom_pwy.tbl
\|ARGDEGRAD-PWY2\||\|NADPHOS-DEPHOS-PWY\|
1
Checking for pathways and reactions in: Nitrosothermus.faa PWY-40
Number of pathways to be considered: 1
1/1: Checking for pathway |PWY-40| putrescine biosynthesis I with 2 reactions
(Biosynthesis,Polyamine-Biosynthesis,Putrescine-Biosynthesis)
1) ARGDECARBOX-RXN arginine decarboxylase, biosynthetic 4.1.1.19 speA speA SPE1 MADC2 MADC3 ADC ADC pdaD speA ADC adiA speA
--> Found sequences: /home/silvio/Software/gapseq/src/../dat/seq/Archaea/rev/4.1.1.19.fasta (32 sequences)
Final file: /tmp/tmp.S1WcusPC8v/tmp.BE8ElyVNwY (32 sequences)
Blast hit (2x)
bit=92.0 id=40.000 cov=80 hit=UniRef90_A8AAB6
bit=81.6 id=32.456 cov=85 hit=UniRef90_A8MBV3
Candidate reaction for import: 1
2) AGMATIN-RXN agmatinase 3.5.3.11 speB AGMAT speB
--> Found sequences: /home/silvio/Software/gapseq/src/../dat/seq/Archaea/rev/3.5.3.11.fasta (3 sequences)
Final file: /tmp/tmp.S1WcusPC8v/tmp.brVixrq3ku (3 sequences)
Blast hit (1x)
bit=146 id=35.433 cov=88 hit=UniRef90_Q5JI38
Candidate reaction for import: 1
Pathway completeness: 2/2 (100%)
Hits with candidate reactions in database: 2/2
Key reactions: 0/0
Pathways found:
putrescine biosynthesis I
Candidate reactions found: 2
%CPU %MEM COMMAND
4.0 0.0 /bin/bash /home/silvio/Software/gapseq/src/gapseq_find.sh -p PWY-40 -t
Running time: 1 s.
# gapseq version: 1.2 6225829
# Sequence DB md5sum: 748b938 (2022-12-19, Archaea)
# Genome format: prot
rxn name ec bihit qseqid pident evalue bitscore qcovs stitle sstart send pathway statuspathway.status dbhit complex exception complex.status
ARGDECARBOX-RXN arginine decarboxylase, biosynthetic 4.1.1.19 UniRef90_A8AAB6 40.000 2.30e-26 92.0 80 RMF31093.1 MAG: adenosylmethionine decarboxylase [Candidatus Nitrosothermus koennekii] 3 116 |PWY-40| good_blast full rxn00405 NA 0 1
ARGDECARBOX-RXN arginine decarboxylase, biosynthetic 4.1.1.19 UniRef90_A8MBV3 32.456 1.79e-22 81.6 85 RMF31093.1 MAG: adenosylmethionine decarboxylase [Candidatus Nitrosothermus koennekii] 3 114 |PWY-40| good_blast full rxn00405 NA 0 1
AGMATIN-RXN agmatinase 3.5.3.11 UniRef90_Q5JI38 35.433 7.94e-44 146 88 RMF31931.1 MAG: agmatinase [Candidatus Nitrosothermus koennekii] 24 272 |PWY-40| good_blast full rxn00858 NA 0 1
# gapseq version: 1.2 6225829
# Sequence DB md5sum: 748b938 (2022-12-19, Archaea)
# Genome format: prot
ID Name Prediction Completeness VagueReactions KeyReactions KeyReactionsFound ReactionsFound
|PWY-40| putrescine biosynthesis I true 100 0 0 0 ARGDECARBOX-RXN AGMATIN-RXN
Hello!
Thank you for reproducing my data/.faa to try and find the source of the error. It is much appreciated!
As requested here are the photos of my output (.tsv instead of cat for more clarity for me) and like yourself I am getting no "bad_blast" for PWY-40. Which is incredible strange and odd I have likewise tried running GapSeq on the entire "Polyamine-Biosynthesis" and am still getting now a "good_blast" (in other words no "bad_blast")
Very odd but seems like in the end I think the major source of the problem (in why this protein was included when other analysis didn't) might be related to the low bit-score I set up.
I tried to reproduce the issue with an Nitrosothermus koennekii assembly from NCBI. Everything was just as expected: There were no bad_blast hits for Pathway PWY-40.
Could you run the following commands and post the output?
However, if it is okay, I just have another confusing find in that ORNDECARBOX-RXN ornithine decarboxylase 4.1.1.17 (please see the attached photos) says it is has a "bad_blast" yet the bit-score is above my 60 threshold (bit-score 86.9), whereas the output says ARGINASE-RXN (arginase) is present.
One idea, I am thinking it is related to ORNDECARBOX-RXN ornithine decarboxylase having "bihit" as NA while ARGINASE-RXN (arginase) doesn't. Therefore do you think this the reason? If so, and I apologize (as I am still pretty new to bioinfomatics) but what does "bihit" {Bidirectional hit} mean?
Thanks again for all your help it is so very much appreciated!
Hi Calvin,
unfortunately, your screenshot does not show the entire table (That is why I suggested the output of the cat
command). You will see that the column "exception" says 1 for EC 4.1.1.17. This means this enzyme requires a higher identity to be identified as "good_blast". The gapseq publication in the section "Pathway and subsystem prediction" explains the exception handling.
One small note: When posting code blocks or parts of command line log outputs, you could consider posting the text in code blocks (https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks) instead of screenshots.
I am closing the issue for now as we can not reproduce the described problem. Please feel free to re-open if issue remains.
Good day,
My name is Calvin, and my overall question is, why did GapSeq include/accept a protein as being present when it got a bad blast?
In more detail, I am using GapSeq to assess the metabolic pathways for a large group of Archaea. One group of pathways we are particularly interested in are Polyamines.
So I ran GapSeq using a bit-score of 60 and -z 2 (unreviewed only if reviewed not available), which all went well with no issues.
But, when I checked my results, "Putrescine Biosynthesis I" stood out as being complete in my species, which is strange as previous analysis (blast and KEGG) showed that my species lacked the "arginine decarboxylase" to make it happen.
Therefore, I checked the Reaction file (output from GapSeq), and I found that for all three examples of "arginine decarboxylase," they were labelled as "bad_blast" (which, according to the GapSeq wiki says it was due to blast hit with lower quality, i.e. lower bit-score, coverage, or identity than needed for cutoffs)
All three had e-values = 5.82e-24 bit-scores = 86.3
Therefore, my question is, why did GapSeq include/accept "arginine decarboxylase" as being present when I got a bad blast?
Is it due to the bit-score being higher than my threshold? However, oddly, when I checked another protein with a "bad blast" that got a similar value (e-value = 5.40e-21; bit-score = 89.7), the Pathway file said it was absent.
Any help would be much appreciated. Thanks
Calvin