jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
219 stars 30 forks source link

issue with ValueError when running virsorter #57

Open lizzymomo opened 3 years ago

lizzymomo commented 3 years ago

Dear jiarong, I'm using virsoter 2.2.1 and stucked in an error when running virsorter. The error occur after Step 2 - extract-feature finished. The linux command is: virsorter run -w vs2_out -i contigs.fa --min-score 0.5 -j 20 all Here is the error message: [2021-04-14 12:25 INFO] # of seqs < 0 bp and removed: 0 [2021-04-14 12:25 INFO] # of circular seqs: 82 [2021-04-14 12:25 INFO] # of linear seqs : 199918 [2021-04-14 12:25 INFO] Finish spliting circular contig file with common rbs [2021-04-14 12:25 INFO] Finish spliting linear contig file with common rbs [2021-04-14 12:28 INFO] Step 1 - preprocess finished. [2021-04-14 15:18 INFO] Step 2 - extract-feature finished. Traceback (most recent call last): File "/lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/add-extra-to-table.py", line 72, in main() File "/lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/add-extra-to-table.py", line 64, in main if max_score_group_ser.loc[i] != df_info.loc[i, 'max_score_group']: File "/lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/db/conda_envs/e6a1828d/lib/python3.8/site-packages/pandas/core/generic.py", line 1478, in nonzero raise ValueError( ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). [Wed Apr 14 15:53:51 2021] Error in rule finalize: jobid: 7 output: final-viral-score.tsv, final-viral-combined.fa, final-viral-boundary.tsv conda-env: /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/db/conda_envs/e6a1828d shell:

        echo iter-0/*/all.pdg.gff.splitdir/all.pdg.gff.*.split | xargs rm -f
        python /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/add-extra-to-table.py iter-0/viral-combined-proba.tsv iter-0/viral-combined.fa iter-0/viral-combined-proba-more-cols.tsv
        python /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/filter-score-table.py config.yaml iter-0/viral-combined-proba-more-cols.tsv iter-0/viral-combined.fa final-viral-score.tsv final-viral-combined.fa.trim
        python /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/keep-original-seq.py final-viral-combined.fa.trim /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/1kbp_all_testSet_vs2.fa > final-viral-combined.fa.original
        cp iter-0/viral-fullseq.tsv final-viral-boundary.tsv
        tail -n +2 iter-0/viral-partseq.tsv >> final-viral-boundary.tsv

        if [ False = "True" ]; then
            cp final-viral-combined.fa.original final-viral-combined.fa
        else
            cp final-viral-combined.fa.trim final-viral-combined.fa
        fi

        if [ False = "True" ]; then
            mkdir -p for-dramv
            python /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/modify-seqname-for-dramv.py final-viral-combined.fa.original final-viral-score.tsv -o for-dramv/final-viral-combined-for-dramv.fa
            cp iter-0/viral-affi-contigs-for-dramv.tab for-dramv
        fi
        rm -f final-viral-combined.fa.trim final-viral-combined.fa.original

        N_lt2gene=$(grep -c '^>.*||lt2gene$' final-viral-combined.fa || :)
        N_lytic=$(grep -c '^>.*||full$' final-viral-combined.fa || :)
        N_lysogenic=$(grep -c '^>.*||.*_partial$' final-viral-combined.fa || :)
        if [ False = True ]; then
            Dramv_notes="for-dramv                   ==> dir with input files for dramv"
            Dramv_notes2="For seqnames in files for dramv,
                | is replaced with _ to be compatible with DRAMv
            "
        else
            Dramv_notes=""
            Dramv_notes2=""
        fi

        if [ False = True ]; then
            sed -i -E 's/\|\|full([[:space:]]+)/\1/; s/\|\|[0-9]+_partial([[:space:]]+)/\1/; s/\|\|lt2gene([[:space:]]+)/\1/;' final-viral-score.tsv
            sed -i -E 's/\|\|full$//; s/\|\|[0-9]+_partial$//; s/\|\|lt2gene$//;' final-viral-combined.fa final-viral-boundary.tsv
            if [ False = True ]; then
                sed -i -E 's/__full(\|[0-9]+\|(c|l)$)/\1/; s/__[0-9]+_partial(\|[0-9]+\|(c|l)$)/\1/; s/\|\|lt2gene(\|[0-9]+\|(c|l)$)/\1/;  s/__full(__[0-9]+\|)/\1/; s/__[0-9]+_partial(__[0-9]+\|)/\1/; s/__lt2gene(__[0-9]+\|)/\1/;' for-dramv/viral-affi-contigs-for-dramv.tab
                sed -i -E 's/__full(-cat_[1-6]$)/\1/; s/__[0-9]+_partial(-cat_[1-6][[:space:]]+)/\1/; s/\|\|lt2gene(-cat_[1-6][[:space:]]+)/\1/;' for-dramv/final-viral-combined-for-dramv.fa
            fi
            Suffix_notes=""
        else
            Suffix_notes="
            Suffix is added to seq names in final-viral-combined.fa:
            full    seqs (>=2 genes) as viral:      ||full
            partial seqs (>=2 genes) as viral:      ||partial
            short   seqs (< 2 genes) as viral:      ||lt2gene
            $Dramv_notes2
            "
        fi

        printf "
        ====> VirSorter run (provirus mode) finished.
        # of full    seqs (>=2 genes) as viral:     $N_lytic
        # of partial seqs (>=2 genes) as viral:     $N_lysogenic
        # of short   seqs (< 2 genes) as viral:     $N_lt2gene

        Useful output files:
            final-viral-score.tsv       ==> score table
            final-viral-combined.fa     ==> all viral seqs
            final-viral-boundary.tsv    ==> table with boundary info
            $Dramv_notes
        $Suffix_notes
        NOTES:
        Users can further screen the results based on the following
            columns in final-viral-score.tsv:
            - contig length (length)
            - hallmark gene count (hallmark)
            - viral gene %% (viral)
            - cellular gene %% (cellular)
        The "group" field in final-viral-score.tsv should NOT be used
            as reliable taxonomy info

        <====
        " | python /lustre1/hqzhu_pkuhpc/mli/6_proj/comparison/vs2/VirSorter2/virsorter/./scripts/echo.py

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message

What might be the problem? Thank you for your help!

jiarong commented 3 years ago

It's highly likely that there are sequences with the same name in your input contig file. Can you double check?

lizzymomo commented 3 years ago

It's highly likely that there are sequences with the same name in your input contig file. Can you double check?

Indeed. Now it can run successfully after I remove duplicated sequence names. Thank you for your reply!