nmquijada / tormes

Making whole bacterial genome sequencing data analysis easy
GNU General Public License v3.0
99 stars 32 forks source link

custom protein database #56

Open shlomobl opened 2 years ago

shlomobl commented 2 years ago

Hi, I am analyzing multiple bacterial genomes with very little programing knowledge. The way tormes parses and summarizes the results from all the genomes in tabular files is very helpful! Tormes has now an option to query the genomes with a custom nucleotide database. But what I have is a protein database... is there anyway to do this with tormes? Any other suggestion? In the end, I'd really need a genome X protein sort of table... Thanks!

nmquijada commented 2 years ago

Hi @shlomobl

I am afraid that in the current version of tormes, only custom nucleotide databases for gene search are possible as an integrated option. We have included the chance of custom amino acid database searches in the ongoing development version of the tool, that we hope to release after summer. I will keep you posted.

In the meantime, if you want to use an amino acid database I can guide you to do so by using blastp and by taking advantage of tormes hierarchy of files. Would that be an option for you? The predicted proteins of your genomes would be in the gene_prediction or annotation directories (depending the option you used for run the pipeline)

Additionally, you can add those proteins to the database that is used for annotation with prokka and to look for them in the annotation results.

shlomobl commented 2 years ago

Hi, Yes, please, I appreciate it! Especially if results can be summarized in a presence/absence table with all genomes, similar to VFs/AMR. I guess it is easier to generate a table from BLAST than by adding these genes to annotation? Thanks! S.

nmquijada commented 2 years ago

Hi @shlomobl

Sorry for the late reply. Both doing a BLAST or adding the proteins to the annotation files for the analyses are straightforward processes. However, from the latter you might retrieve back the information from the genes you are looking for.

If you would like the results to appear in the tormes report, it would require some expertise with r-markdown language, which is the one used for the generation of that report. If you don't have experience with this, I would encourage you to wait a bit until we release the next version, which will allow the usage of protein databases for direct "blasting".

In the meantime, if you would like to look for some proteins in your dataset with BLAST, you need to make a blast-formatted database first:

makeblastdb -in my_proteins.faa -title my_prot -out my_db/my_prot -dbtype prot -hash_index

Then, you can run BLASTP over the predicted protein file performed by prodigal (and/or annotated with prokka). For instance:

blastp -query tormes_output/annotation/genome_01_annotation/genome_01.faa -db my_db/my_prot -out blastp_output.txt -max_target_seqs 1000 -culling_limit <culling limit to be used (>1)> -evalue 1e-25 -num_threads <num of CPUs> -outfmt "6 qseqid sseqid length qstart qend sstart send mismatch gaps pident evalue bitscore slen"

#you can add a header to the file with the description of the fields, for instance:
sed -i "qseqid\tsseqid\tlength\tqstart\tqend\tsstart\tsend\tmismatch\tgaps\tpident\tevalue\tbitscore\tslen" blastp_output.txt

As I said, I hope we can release the next version soon. I hope this helps in the meantime and you can do some searches of proteins of your interest!

Best, Narciso

shlomobl commented 2 years ago

Hi, Yes, please, I appreciate it! Especially if results can be summarized in a presence/absence table with all genomes, similar to VFs/AMR. I guess it is easier to generate a table from BLAST than by adding these genes to annotation? Thanks! S.

On Fri, Jun 17, 2022 at 10:30 AM Narciso Martin Quijada < @.***> wrote:

Hi @shlomobl https://github.com/shlomobl

I am afraid that in the current version of tormes, only custom nucleotide databases for gene search are possible as an integrated option. We have included the chance of custom amino acid database searches in the ongoing development version of the tool, that we hope to release after summer. I will keep you posted.

In the meantime, if you want to use an amino acid database I can guide you to do so by using blastp and by taking advantage of tormes hierarchy of files. Would that be an option for you? The predicted proteins of your genomes would be in the gene_prediction or annotation directories (depending the option you used for run the pipeline)

Additionally, you can add those proteins to the database that is used for annotation with prokka and to look for them in the annotation results.

— Reply to this email directly, view it on GitHub https://github.com/nmquijada/tormes/issues/56#issuecomment-1158580613, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKL3XJC4BNT3ABULJN5I6TVPQSRXANCNFSM5Y7T5L2A . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Shlomo Blum, DVM PhD

KSVM Bacteriology and Mycology Lecturer Head of Dept. of Bacteriology and Mycology Kimron Veterinary Institute POB 12 Bet Dagan, 50250 Israel Tel.: +972-3-9681680 Mob.: +972-50-6241862