jotech / gapseq

Informed prediction and analysis of bacterial metabolic pathways and genome-scale networks
GNU General Public License v3.0
151 stars 32 forks source link

GapSeq protein is absent when it is .faa shows its present #186

Closed Calvin2077 closed 12 months ago

Calvin2077 commented 1 year ago

Good afternoon,

My name is Calvin, and I am conducting my master's in which I am using bioinformatics to do a comparative metabolic analysis on a group of archaea. I came across GapSeq and found it easy to install and use, so thank you for making a great package.

However, when I ran ~/gapseq/gapseq find -p C1-COMPOUNDS -t Archaea -b 60 -l all Nitrosocosmicus_hydrocola.faa

One pathway, "formate oxidation to CO2," says that the protein NAD-dependent formate dehydrogenase [1.17.1.9] is absent.

Yet when I look into this species .faa file, I find the protein that completes this pathway formate dehydrogenase [EC:1.17.1.9].

Therefore, there seems to be a disconnect between GapSeq and what is present within my .faa file, and thus if I can please have some help in this matter, it would both mean a lot to me and take a massive weight off my shoulders.

Thank you very much, and I hope to hear from you very soon.

Calvin

p.s. I am still very new to bioinformatics, so I apologize for any inconvenience this may cause

#######################

Checking dependencies

####################### GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2) sed (GNU sed) 4.8 grep (GNU grep) 3.1 This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi tblastn: 2.6.0+ exonerate from exonerate version 2.4.0 bedtools v2.31.0 barrnap 0.9 - rapid ribosomal RNA prediction R version 4.2.3 (2023-03-15) -- "Shortstop Beagle" git version 2.17.1 GNU parallel 20161222 HMMER 3.3.2 (Nov 2020); http://hmmer.org/ bc 1.07.1

Missing dependencies: 0

#####################

Checking R packages

##################### data.table 1.14.8 stringr 1.5.0 sybil 2.2.0 getopt 1.20.3 doParallel 1.0.17 foreach 1.5.2 R.utils 2.12.2 stringi 1.7.12 glpkAPI 1.3.4 BiocManager 1.30.21 Biostrings 2.38.4 jsonlite 1.8.7 CHNOSZ 1.4.3 httr 1.4.6

Missing R packages: 0

##############################

Checking basic functionality

############################## Optimization test: OK Building full model: OK Blast test: OK

Passed tests: 3/3

Calvin2077 commented 1 year ago

Minor update when I switched the Taxonomic range (-t) from Archaea to Bacteria, GapSeq successfully identified formate dehydrogenase in my archaeal species, which is great.

However, when I check the Pathway file output, there is a significant difference between the degree of completion for other metabolic pathways depending on whether I run GapSeq using the Taxonomic range for Archaea and Bacteria. This makes sense as the enzymes that make them up pathways can differ between Archaea and Bacteria.

As a result, can I please ask how I should progress or what Taxonomic range I should choose for my project?

Waschina commented 12 months ago

Dear @Calvin2077 ,

If you know your organism is an Archaea, we recommend using the option `-t Archaea".

The results from gapseq find are expected to depend on the chosen taxonomic range since the reference sequence database between Archaea and Bacteria differs.

In your case, gapseq misses the identification of EC 1.17.1.9 because there is no closely related reference protein in the Archaea reference protein sequences. I will check if the reference sequences for Archaea can be improved so it will correctly predict the formate dehydrogenase presence.

Waschina commented 12 months ago

I just checked, and the problem is that there has yet to be a reviewed protein sequence for Archaea for the Enzyme EC 1.17.1.9 on UniProt.

A workaround in your case could be the gapseq command:

~/gapseq/gapseq find -p C1-COMPOUNDS -t Archaea -z 3 Nitrosocosmicus_hydrocola.faa

The option -z 3 includes reference sequences from UniProt that are reviewed and unreviewed. Thus, the prediction is more relaxed.

I hope this helps you.

Calvin2077 commented 11 months ago

Awesome thank you very much for your help much appreciated. Is there a way however to change the taxonomic range to include both bacteria and archaea? (e.g., instead of only -t Archaea I can put -t Archaea,Bacteria).

I did note a option "-m Limit pathways to taxonomic range (default: )". However, it seems to lack any input word as if possible I would love to turn this off and hence expand the homology searching of sequences across all tax

Waschina commented 11 months ago

Currently, the option -t cannot include both, Archaea and Bacteria. However, I remember discussing this with @jotech. The option to include reference protein sequences from both, Archaea and Bacteria, would be a nice feature. I will add this as a feature request that we will try to add in the future (although I can't promise a specific timeframe until when we will be able to implement the feature).

I did note a option "-m Limit pathways to taxonomic range (default: )". However, it seems to lack any input word ...

Thank you for pointing this out. I just corrected it. It now states:

-m Limit pathways to taxonomic range (default: all)
Calvin2077 commented 11 months ago

Fantastic, thank you for telling me and for your quick response; this is much appreciated and helps a lot!

Moreover, thank you for both making it a feature request as well as for updating this parameter!

If it is okay, I just have one more quick question. In I am looking at the Reactions.tbl, in particular at the 'Status' column, I find some proteins have a "no_blast" option, and others have "no_seq_data." therefore, can you please tell me what these mean?

I think it means that in the reference database (e.g., Archaea), there isn't a corresponding sequence to compare against; therefore, I was wondering if there is some way to circumvent that and have gapseq search the entire database for sequences to compare to.

Thanks again!

Waschina commented 11 months ago

no_blast means that there was no hit between the reference protein database for that enzyme and the input genome sequence. no_seq_data means that the reference protein database does not contain any sequence for the corresponding enzyme.

At the moment, there is no way that in those cases, the search includes sequences from a different domain than the one that is set with the option -t.

Calvin2077 commented 11 months ago

Awesome thank you for explaining these too me it is much appreciated, and helps a lot. Therefore, is there a way to update the reference protein database to include these sequences?

Waschina commented 11 months ago

You can tell gapseq to retrieve the latest sequences from UniProt with the option -U. For instance:

gapseq find -p TRPSYN -U  ecoli.faa.gz

But please keep in mind that this is very slow due to HTTP queries to Uniprot's API. And it will only retrieve the sequences for the taxonomic range set by -t.

Calvin2077 commented 11 months ago

Makes sense and understandable! Thanks for telling me this information it is much appreciated. Overall I really enjoy Gapseq it is easy to use and install and can give great accessible results so thank you so much for creating an amazing program.