jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
219 stars 30 forks source link

errors when predict viruses #81

Closed mujiezhang closed 2 years ago

mujiezhang commented 3 years ago

Hi, I have met a strange error when I used virsorter2. It is like this:

[2021-07-16 10:29 INFO] # of seqs < 5000 bp and removed: 0 [2021-07-16 10:29 INFO] # of circular seqs: 0 [2021-07-16 10:29 INFO] # of linear seqs : 1 [2021-07-16 10:29 INFO] No circular seqs found in contig file [2021-07-16 10:29 INFO] Finish spliting linear contig file with common rbs Fatal exception (source file p7_pipeline.c, line 697): Target sequence length > 100K, over comparison pipeline limit. (Did you mean to use nhmmer/nhmmscan?) /usr/bin/bash: line 34: 236456 Aborted hmmsearch -T 30 --tblout iter-0/all.pdg.faa.splitdir/all.pdg.faa.0.split.Mixed.splithmmtbl --cpu 1 --noali -o /dev/null $Hmmdb $Tmp/$Bname [Fri Jul 16 10:30:44 2021] Error in rule hmmsearch: jobid: 87 output: iter-0/all.pdg.faa.splitdir/all.pdg.faa.0.split.Mixed.splithmmtbl conda-env: /lustre/home/acct-clsjhh/clsjhh/zmj/db/conda_envs/59c18b67 shell:

    Domain=Mixed
    if [ $Domain = "Viruses" ]; then
        Hmmdb=/lustre/home/acct-clsjhh/clsjhh/zmj/db/hmm/viral/combined.hmm

...

So what is the reason for this? And how can I solve the errors? Thanks!

jiarong commented 3 years ago

The error shows your input sequence is producing a protein sequence longer 100K AA, which is quite unlikely to be real and should probably be discarded. You can take a look at the contig sequence and check if there is anything strange.

mujiezhang commented 3 years ago

Thanks for your reply! But I wonder why a sequence producing a protein sequence longer 100K AA should be discarded. Actually, the input sequence is the bacteria-GCF_001499735.1 which was download from RefSeq. And I also have the same errors in seven other bacteria which are also download from RefSeq. So, I wonder why the virsorter2 just quit ranther than pose a warning information when the script meet these sequence. And I have another question. I find the number of contigs in the final-viral-score.tsv file is slightly smaller than it in the final-viral-boundary.tsv file. So how can I explain that? 发送自 Windows 10 版邮件应用

发件人: jiarong 发送时间: 2021年7月16日 13:50 收件人: jiarong/VirSorter2 抄送: mujiezhang; Author 主题: Re: [jiarong/VirSorter2] errors when predict viruses (#81)

The error shows your input sequence is producing a protein sequence longer 100K AA, which is quite unlikely to be real and should probably be discarded. You can take a look at the contig sequence and check if there is anything strange. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jiarong commented 3 years ago

The 100K AA is set by a dependency (hmmer), which I can not control on VirSorter2 side. It's possible to improve the error handling. A gene producing 100K AA is unlikely to be real (either VirSorter2 can not predict the genes well for these specific bacteria genomes or there are some issues with genome sequences).

final-viral-combined.fa and final-viral-score.tsv are the final result to look at.

mujiezhang commented 3 years ago

Really thanks! It is helpful!