jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
210 stars 28 forks source link

Need help in understanding the output file #195

Closed sahilrishav2 closed 3 months ago

sahilrishav2 commented 5 months ago

Hi,

I am using metatranscriptome data to identify bacteriophages using the vs2 tool. The problem is that I am unable to understand how to know, which virus is identified from the viral contigs. Also, which viral genes have been identified? This is the output file:

head final-viral-score.tsv seqname dsDNAphage RNA max_score max_score_group length hallmark viral cellular k141_21416||full 0.567 0.825 0.825 RNA 2115 0 33.300 0.000 k141_18152||full 0.767 0.720 0.767 dsDNAphage 774 0 33.300 0.000 k141_29671||full 0.513 0.395 0.513 dsDNAphage 322 0 50.000 0.000 k141_3518||full 0.727 0.445 0.727 dsDNAphage 479 0 50.000 0.000 k141_15115||full 0.920 0.690 0.920 dsDNAphage 383 0 50.000 0.000 k141_16734||full 0.560 0.950 0.950 RNA 3788 1 66.700 0.000 k141_31654||full 0.640 0.200 0.640 dsDNAphage 571 0 50.000 0.000 k141_30002||full 0.827 0.235 0.827 dsDNAphage 412 0 50.000 0.000 k141_13663||full 0.527 0.090 0.527 dsDNAphage 652 0 33.300 0.000

Please help me regarding this concern, thank you

jiarong commented 5 months ago

Hi, VS2 is design to tell if a sequence is virus or non-virus, not optimized to assigning specific virus groups which is a classification task. But, the "hallmark" column (>0) is strong signal that the assigned virus group in "max-score-group" is correct. In your example, you have a high confidence RNA virus:

k141_16734||full 0.560 0.950 0.950 RNA 3788 1 66.700 0.000

For more detailed viral gene annotation, you can take a look at DRAM-v. You can see an example in this SOP.

sahilrishav2 commented 5 months ago

Thank you so much, sir. I check this and come again if any further assistance is required.

sahilrishav2 commented 4 months ago

Hi, I want to ask if there is a way to speed up the processing. I am running vs2 using 40 threads, still, it is taking too much time

jiarong commented 4 months ago

Other than -j, you could split up your data and run them in parallel.

sahilrishav2 commented 4 months ago

ok, thank you for your response

sahilrishav2 commented 4 months ago

Hello, As per your suggestions, I am following the SOP tutorial for the annotation process but it gave an error on one of the files while the analysis on the other files is completed. This is the error:

virsorter run --seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv -i SRR2083773.out/final-viral-combined_output_directory/combined.fna -w SRR2083773.out/final-viral-combined_output_directory/vs2-pass2 --include-groups dsDNAphage,RNA --min-score 0.5 -j 20 all

sed: can't read for-dramv/final-viral-combined-for-dramv.fa: No such file or directory
[Tue Apr 23 10:03:19 2024]
Error in rule finalize:
    jobid: 7
    output: final-viral-combined.fa, final-viral-score.tsv
    conda-env: /jbod_new/home4_jbod/rishav1/VirSorter2/virsorter_db/conda_envs/72fd9e48
    shell:

            echo iter-0/*/all.pdg.gff.splitdir/all.pdg.gff.*.split | xargs rm -f
            python /jbod_new/home4_jbod/rishav1/VirSorter2/virsorter/./scripts/filter-score-table.py config.yaml iter-0/viral-combined-proba-more-cols.tsv iter-0/viral-combined.fa final-viral-score.tsv final-viral-combined.fa

            if [ True = "True" ]; then
                mkdir -p for-dramv
                python /jbod_new/home4_jbod/rishav1/VirSorter2/virsorter/./scripts/modify-seqname-for-dramv.py final-viral-combined.fa final-viral-score.tsv -o for-dramv/final-viral-combined-for-dramv.fa
                cp iter-0/viral-affi-contigs-for-dramv.tab for-dramv
            fi

            N_viral_fullseq=$(grep -c '^>.*||full$' final-viral-combined.fa || :)
            N_viral_lt2gene=$(grep -c '^>.*||lt2gene$' final-viral-combined.fa || :)
            if [ True = True ]; then
                Dramv_notes="for-dramv                  ==> dir with input files for dramv"
                Dramv_notes2="For seqnames in files for dramv, 
                    | is replaced with _ to be compatible with DRAMv"
            else
                Dramv_notes=""
                Dramv_notes2=""
            fi
            if [ True = True ]; then
                sed -i -E 's/(\|\|full([[:space:]]+)|\|\|[0-9]+_partial([[:space:]]+)|\|\|lt2gene([[:space:]]+))/\2\3\4/;' final-viral-score.tsv
                sed -i -E 's/(\|\|full$|\|\|[0-9]+_partial$|\|\|lt2gene$)//;' final-viral-combined.fa
                if [ True = True ]; then
                    sed -i -E 's/(__full(\|[0-9]+\|(c|l)$)|__[0-9]+_partial(\|[0-9]+\|(c|l)$)|__lt2gene(\|[0-9]+\|(c|l)$))/\2\4\6/;'  for-dramv/viral-affi-contigs-for-dramv.tab
                    sed -i -E 's/(__full(__[0-9]+\|)|__[0-9]+_partial(__[0-9]+\|)|__lt2gene(__[0-9]+\|))/\2\3\4/;' for-dramv/viral-affi-contigs-for-dramv.tab
                    sed -i -E 's/(__full(-cat_[1-6]$)|__[0-9]+_partial(-cat_[1-6]$)|__lt2gene(-cat_[1-6]$))/\2\3\4/;' for-dramv/final-viral-combined-for-dramv.fa 
                fi
                Suffix_notes=""
            else
                Suffix_notes="
                Suffix is added to seq names in final-viral-combined.fa:
                contigs (>=2 genes) as viral:   ||full
                contigs (< 2 genes) as viral:   ||lt2gene
                $Dramv_notes2
                "
            fi

            printf "
            ====> VirSorter run (non-provirus mode) finished.
            # of contigs w/ >=2 genes as viral: $N_viral_fullseq
            # of contigs w/ < 2 genes as viral: $N_viral_lt2gene

            Useful output files:
            final-viral-score.tsv      ==> score table
            final-viral-combined.fa    ==> all viral seqs
            $Dramv_notes
            $Suffix_notes

            NOTES: 
            Users can further screen the results based on the 
                following columns in final-viral-score.tsv
                - contig length (length) 
                - hallmark gene count (hallmark)
                - viral gene %% (viral) 
                - cellular gene %% (cellular)
            The "group" field in final-viral-score.tsv should NOT be used
                as reliable taxonomy info
            We recommend this SOP/tutorial for quality control 
                (make sure to use the lastest version):
                https://dx.doi.org/10.17504/protocols.io.bwm5pc86

            <====
            " | python /jbod_new/home4_jbod/rishav1/VirSorter2/virsorter/./scripts/echo.py

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message

Could you please help me with this concern ? Thank you

jiarong commented 4 months ago

There might be not viral seqs found. You can take a look at the following two files in output directory to check.

final-viral-combined.fa
final-viral-score.tsv
sahilrishav2 commented 4 months ago

Ok Sir, thank you so much