jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
225 stars 31 forks source link

Run stop #199

Closed ntromas closed 4 months ago

ntromas commented 6 months ago

Hi VS2 team,

I am running VS2 on a large file (contigs from metagenomes) and I got this issue:

This is the error:

2/lib/python3.10/site-packages/virsorter/./scripts/provirus.py iter-0/dsDNAphage/all.pdg.gff.splitdir/all.pdg.gff.0.split iter-0/dsDNAphage/all.pdg.hmm.tax /mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/vir_db/rbs/rbs-catetory.tsv /mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/vir_db/group/dsDNAphage/model iter-0/dsDNAphage/all.pdg.gff.splitdir/all.pdg.gff.0.split.prv.bdy iter-0/dsDNAphage/all.pdg.gff.splitdir/all.pdg.gff.0.split.prv.ftr --fullseq-clf iter-0/all-fullseq-proba.tsv --group dsDNAphage --proba 0.5 2> $Log || { echo "See error details in $Log" | python /mfs/nicot/miniconda3/envs/vs2/lib/python3.10/site-packages/virsorter/./scripts/echo.py --level error; exit 1; } fi rm -f $Log

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[2024-05-17 02:47 ERROR] See error details in /mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/VS2/log/iter-0/step3-classify/pick-viral-fullseq.log [Fri May 17 02:47:04 2024] Error in rule pick_viral_fullseq: jobid: 38 output: iter-0/viral-fullseq.fa, iter-0/all-hallmark-cnt.tsv, iter-0/viral-lt2gene-w-hallmark.fa conda-env: /mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/vir_db/conda_envs/5631f754 shell:

    Log=/mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/VS2/log/iter-0/step3-classify/pick-viral-fullseq.log
    python /mfs/nicot/miniconda3/envs/vs2/lib/python3.10/site-packages/virsorter/./scripts/pick-viral-contig-from-clf.py 0.5 iter-0/all-fullseq-proba.tsv iter-0/all.fna > iter-0/viral-fullseq.fa.tmp 2> $Log || { echo "See error details in $Log" | python /mfs/nicot/miniconda3/envs/vs2/lib/python3.10/site-packages/virsorter/./scripts/echo.py --level error; exit 1; }

This is the log information:

cat /mfs/nicot/virus/new_analysis/VIRSORTER_DRAM/VS2/log/iter-0/step3-classify/pick-viral-fullseq.log Traceback (most recent call last): File "/mfs/nicot/miniconda3/envs/vs2/lib/python3.10/site-packages/virsorter/./scripts/add-extra-to-fullseq-fasta-header.py", line 114, in main() File "/mfs/nicot/miniconda3/envs/vs2/lib/python3.10/site-packages/virsorter/./scripts/add-extra-to-fullseq-fasta-header.py", line 92, in main start_ind, end_ind, viral, cellular, hallmark = d_name2info[name] KeyError: 'S1Ck141NC54963'

Is there a specific format for the header?

Command and version: VirSorter 2.2.4 /mfs/nicot/miniconda3/envs/vs2/bin/virsorter run -i ../virus_postCheckV_5000.nored_nodup.fa --include-groups dsDNAphage,dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae -j 50 --prep-for-dramv -d vir_db/ -w VS2

Cheers,

Nico

jiarong commented 5 months ago

Hi, sorry for the late reply. I missed this issue somehow... I can not tell the exact issue from the post info. It's generally not recommended to have punctuation other than underscore or dot in the original fasta header. If you search for "S1Ck141NC54963" in the input fasta, you might be able to find out what in the fasta header is causing the issue.

ntromas commented 5 months ago

Hi,

Thanks for the answer. The fasta header is only composed of S1Ck141NC54963 This is why I don't get the issue with header name.

Thks for the help!

Cheers

Nico

Le mer. 5 juin 2024 20 h 49, jiarong @.***> a écrit :

Hi, sorry for the late reply. I missed this issue somehow... I can not tell the exact issue from the post info. It's generally not recommended to have punctuation other than underscore or dot in the original fasta header. If you search for "S1Ck141NC54963" in the input fasta, you might be able to find out what in the fasta header is causing the issue.

— Reply to this email directly, view it on GitHub https://github.com/jiarong/VirSorter2/issues/199#issuecomment-2150733508, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABY5D6A2NUCCMKPWPVP7Y7DZF5MSDAVCNFSM6AAAAABH4EMEMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQG4ZTGNJQHA . You are receiving this because you authored the thread.Message ID: @.***>

jiarong commented 5 months ago

There might be hidden characters around it, which might happen file became corrupt in big data processing. If this is the case, I would suggest split the big file into smaller pieces and run them separately.

ntromas commented 5 months ago

Hi,

I did it and got the same issue with smaller files.

Input splitted: nicot@SuperPhelix5000:~/virus/new_analysis/splitted_virus_fasta$ ls out_0.fasta out_1.fasta out_2.fasta out_3.fasta out_4.fasta out_5.fasta out_6.fasta out_7.fasta out_8.fasta out_9.fasta

Verification header that cause issue: nicot@SuperPhelix5000:~/virus/new_analysis/splitted_virus_fasta$ grep -A 1 "S1Ck141NC54963" out_0.fasta

S1Ck141NC54963 GACCTATTGATTTTGTGACAAGGCGCAAAGCATCAAATTCGTTCATGGGCTTGCGTTCTAACTCTGCCAG

nicot@SuperPhelix5000:~/virus/new_analysis/splitted_virus_fasta$ grep -e "S1Ck141NC54963" out_0.fasta

S1Ck141NC54963

Not sure to see any special char...Or maybe it is a space... I can send you an example of the input...

Cheers,

Nico

jiarong commented 5 months ago

So the other files ran successfully, right? For specially characters, you need to open the file in text editor to see. If the file is too big you can do

grep -A 1 -B 2 "S1Ck141NC54963" out_0.fasta > tmp.fasta

Then open tmp.fasta in text editor.

ntromas commented 5 months ago

Unfortunately for each files there is a similar issue but for different header. I just add 2 examples here for different header that gave similar issue.

tmp.txt tmp2.txt

jiarong commented 5 months ago

I did not find any hidden special characters in the attached files, but there must be some issue in your input file. How big is out_0.fasta? Can you send me one of those smallest one that failed the run?

ntromas commented 5 months ago

100.tar.gz

Sure, this one failed.

Cheers,

jiarong commented 5 months ago

Just had time to took a look. It turns out the duplicate group in your command --include-groups dsDNAphage,dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae is causing the issue. If you remove the duplicated dsDNAphage, it should run successfully.

ntromas commented 5 months ago

Huh... I focus on the error message but did not look enough the command... Feeling a bit stupid now :) Thanks!