blastx metagenome assembly aligned to custom database too slowly

Ash1One commented 4 years ago

Hello, @tseemann I have bulit the DeepARGs from https://bitbucket.org/gusphdproj/deeparg-ss/src/master/database/ and it have nearly 12000 sequences. I abricate my metegonome assembly ORFs file to card database and it took dozens of minutes. But when I abricate ORFs file to the DeepARGs database, it has already took more than 20 hours and still not end.

DATABASE        SEQUENCES       DBTYPE  DATE
argannot        2223    nucl    2019-Sep-27
card    2594    nucl    2019-Sep-27
ecoh    597     nucl    2019-Sep-27
ecoli_vf        2701    nucl    2019-Sep-27
ncbi    5029    nucl    2019-Sep-27
plasmidfinder   460     nucl    2019-Sep-27
resfinder       3077    nucl    2019-Sep-27
vfdb    2597    nucl    2019-Sep-27
DeepARGs        12279   prot    2019-Sep-27

I would like to know it just because blastx is slow or I made something wrong?

(abricate-env) [huanghao@localhost diamond]$ head -n 4 ~/conda/envs/abricate-env/db/DeepARGs/sequences
>DeepARGs~~~VIM~~~JN129451.1.gene1.p01~~~beta-lactam From DeepARGs DB
MFKLLSKLLVYLTASIMAIASPLAFSVDSSGEYPTVNEIPVGEVRLYQIADGVWSHIATQSFDGAVYPSNGLIVRDGDELLLIDTAWGAKNTAALLAEIEKQIGLPVTRAVSTHFHDDRVGGVDVLRAAGVATYASPSTRRLAEVEGNEIPTHSLEGLSSSGDAVRFGPVELFYPGAAHSTDNLVVYVPSASVLYGGCAIYELSRTSAGNVADADLAEWPTSIERIQQHYPEAQFVIPGHGLPGGLDLLKHTTNVVKAHTNRSVVE
>DeepARGs~~~mdtN~~~YP_002385298~~~multidrug From DeepARGs DB
MESTPKNATRNKLPALILTVAAVVALVYVIWRVDSAPATNDAYASADTVDVVPEVSGRIVELAVKDNQLVKQGDLLFRIDPRPYEASLAKAQASLTALDKQIMLTQRSVEAQQLGAAAVKTSVEKALAIVHQTSKTFQRTESLLAEGYVSDEDVDRARTAHRSAQVDYAALLLQAQSAVSGVGGVDALVAQREAVLADIALTKLHLEMATVRAPFDGRVVSLKTSVGQFASAMRPIFTLIDTRHWYVIANFRETELNNIRAGTPATVRLMSDSGKTFEGKVDSIGYGVLPDDGGMVLGGLPRVSRSINWVRVAQRFPVKIMVDNPDPEMFRIGASAVANLEPQ

Thanks!

tseemann commented 4 years ago

The protein mode of Abricate is undocumented and should not be used.

It is doing the search the wrong way around. It will never finish running. The proper way it to use tblastn of DeepARG to contigs, not blastx of contigs to DeepARG.

Does DeepARG really have more true gene families than --db ncbi? Or just more minor alleles?

Does DeepARG gave it's own annotation tool? I see a diamond database in their repo.

Ash1One commented 4 years ago

Thank you for your reply. @tseemann As you say,

The proper way it to use tblastn of DeepARG to contigs, not blastx of contigs to DeepARG.

From your advice, I have recognized that blast is a local alighment tool so it is appropriate that blast shorter sequences to a database that consisit of longer sequences. But I also have read abricate code:

111    my $blastcmd = $dbinfo->{DBTYPE} eq 'nucl'
112                 ? "blastn -task blastn -dust no -perc_identity $minid"
113                 : "blastx -task blastx-fast -seg no"
114                 ;
115
116    my $cmd = "(any2fasta -q -u \Q$file\E |"
117          . " $blastcmd -db \Q$db_path\E -outfmt '$format' -num_threads $threads"
118          . " -evalue 1E-20 -culling_limit $CULL"
119    #          . " -max_target_seqs ".$dbinfo->{SEQUENCES}   # Issue #76
120            . ") 2>&1"
121            ;

and I was confused by it. should abricate blast CARD or NCBI sequences to contigs that is more longer than normal antibiotic resistance genes ? By the way, in serveral papers I have read, I found that it usually use Prodigal or MetaGeneMark to predict Open Read Frame from assembly contigs file, then blastx ORF to card or DeepARG database. I have no idea whether tblastn ARGs to ORF or blast ORF to ARGs is a proper way to identify AR-like genes as it is unable to determine one of them is longer than the other. Looking forward to your reply.😀

tseemann commented 4 years ago

In my opinion, relying on an ORF/gene predictor, then using BLASTP, is a bad idea. You could miss important AMR genes due to assembly issues, or bad RBS/promoter. Best to scan directly against the contigs.

I use the special -culling_limit option to ensure only the "best hit" in any region is returned. This avoids getting 800 betalactamase hits all to the same part of the contig.

The local alignment property means it works either way, long vs short or short vs long. if it was glocal (like glsearch36) then you need to put the short as the query.

If you already have ORFs, then you should translate them, and do BLASTP (protein : protein) against the DeepARG or CARD proteins.

Do you have contigs or genes/ORFs ?

Ash1One commented 4 years ago

In my opinion, relying on an ORF/gene predictor, then using BLASTP, is a bad idea. You could miss important AMR genes due to assembly issues, or bad RBS/promoter. Best to scan directly against the contigs.

I use the special -culling_limit option to ensure only the "best hit" in any region is returned. This avoids getting 800 betalactamase hits all to the same part of the contig.

The local alignment property means it works either way, long vs short or short vs long. if it was glocal (like glsearch36) then you need to put the short as the query.

If you already have ORFs, then you should translate them, and do BLASTP (protein : protein) against the DeepARG or CARD proteins.

Do you have contigs or genes/ORFs ?

Yes, I already have ORFs. I would do BLASTP against the DeepARG database as you advice. Thanks for your patiant explanation. :smiley: @tseemann

tseemann commented 4 years ago

You are welcome. And good luck with your search :)

tseemann / abricate

blastx metagenome assembly aligned to custom database too slowly #113