viralInformatics / VIGA

13 stars 1 forks source link

The Ref/*.fa output generated from Step 1 is empty #3

Closed nyzhoulang closed 9 months ago

nyzhoulang commented 9 months ago

Hi,

VIGA is a highly anticipated tool.

I used ERR843912 as a test. After completing Step 1: Virus Identification, I found that the generated output_test/Ref/ERR843912.fa file was empty, which caused an error in Step 2.

image

This is the command I used to run Step 1:

python /data3/Group7/zhoulang/GV/VIGA/VIGA-master/bin/0_run1_paired.py \
--fastq_1 /data3/Group7/zhoulang/GV/IBD_Virome/fastq_output/ERR843912_1.fastq.gz  \
--fastq_2  /data3/Group7/zhoulang/GV/IBD_Virome/fastq_output/ERR843912_2.fastq.gz \
--outdir  /data3/Group7/zhoulang/GV/VIGA/output_test \
--Diamond_VirusProtein_db /data3/Group7/zhoulang/GV/VIGA/db/Diamond_RefSeqVirusProtein.dmnd \
--Diamond_nr_db /data3/Group7/zhoulang/GV/VIGA/db/Dimond_nr.dmnd \
--evalue 1e-5  \
--threads 64

And, the log indicates that some viral strains need to be downloaded from NCBI and named ERR843912.fa, but I don't know which strains correspond to these IDs.

########################################################################

The identified virus id is as follows: ['10757', '2029672', '2079401', '2202644', '2507125', '2772519', '2825454', '2825709', '2826345', '2827858', '38018', '432261', '93678', '977912'] numbers of virus species: 14

######################################################################## 93678 refseqid is ['JX134044, JX134046, KF764701, KX810063, KX810064, KY856742, MK212031'] 93678 lineage: kViruses;p;c;o;fAnelloviridae;gBetatorquevirus;s__TTV-like mini virus

977912 refseqid is ['HM030970, HM030971'] 977912 lineage: kViruses;pKitrinoviricota;cMagsaviricetes;oNodamuvirales;fNodaviridae;g;s__Orsay virus

10757, 2029672, 2079401, 2202644, 2507125, 2772519, 2825454, 2825709, 2826345, 2827858, 38018, 432261 is not in our database, please download the sequence from NCBI and put it in the ./genome/virus/ directory and named as ERR843912.fa ######################################################################## JX134044 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## JX134046 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## KF764701 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## KX810063 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## KX810064 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## KY856742 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## MK212031 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## HM030970 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory ######################################################################## HM030971 get fasta to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Ref/ERR843912 directory [INFO] 0 duplicated records removed [INFO] 0 duplicated records removed [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [E::bwa_idx_load_from_disk] fail to locate the index files [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [E::bwa_idx_load_from_disk] fail to locate the index files [bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [E::bwa_idx_load_from_disk] fail to locate the index files {'2065056': ['TRINITY_DN1124_c0_g1_i1', '|gGammatorquevirus', '|fAnelloviridae'], '2163636': ['TRINITY_DN1416_c0_g1_i1', '|gCarpasinavirus', '|f'], '2202644': ['TRINITY_DN1484_c0_g1_i1', 'TRINITY_DN217_c0_g1_i1', 'TRINITY_DN602_c0_g1_i1', 'TRINITY_DN689_c0_g1_i1', '|g', '|fMicroviridae'], '2809423': ['TRINITY_DN1378_c0_g1_i1', '|gGammatorquevirus', '|fAnelloviridae'], '2825789': ['TRINITY_DN506_c0_g1_i1', '|g', '|f'], '2827822': ['TRINITY_DN1053_c0_g1_i1', '|g', '|f'], '2828174': ['TRINITY_DN96_c0_g1_i1', '|g', '|f'], '2886034': ['TRINITY_DN1453_c0_g1_i1', '|gBeograduvirus', '|f'], '38018': ['TRINITY_DN116_c0_g1_i1', 'TRINITY_DN1247_c0_g1_i1', 'TRINITY_DN1293_c0_g1_i1', 'TRINITY_DN1314_c0_g1_i1', 'TRINITY_DN1409_c0_g1_i1', 'TRINITY_DN1437_c0_g1_i1', 'TRINITY_DN1455_c0_g1_i1', 'TRINITY_DN1460_c0_g1_i1', 'TRINITY_DN255_c0_g1_i1', 'TRINITY_DN342_c0_g1_i1', 'TRINITY_DN39_c0_g19_i1', 'TRINITY_DN39_c0_g6_i1', 'TRINITY_DN578_c0_g1_i1', 'TRINITY_DN581_c0_g1_i1', 'TRINITY_DN688_c0_g1_i1', 'TRINITY_DN750_c0_g1_i1', 'TRINITY_DN893_c0_g1_i1', 'TRINITY_DN906_c0_g1_i1', 'TRINITY_DN939_c0_g1_i1', 'TRINITY_DN945_c0_g1_i1', 'TRINITY_DN960_c0_g1_i1', '|g', '|f'], '93678': ['TRINITY_DN1122_c0_g1_i1', '|gBetatorquevirus', '|fAnelloviridae']} ['gGammatorquevirus', 'g', 'gBetatorquevirus', 'gBeograduvirus', 'g__Carpasinavirus'] totalreads contigsumlen avglen 2104474.0 0 3080.56 totalreads contigsumlen avglen 2104474.0 0 12632.93 totalreads contigsumlen avglen 2104474.0 0 2881.69 *****fastp begins : Mon Feb 19 16:21:08 2024

*****Trinity denovo assembly begins : Mon Feb 19 16:21:28 2024

*****diamond blastx against virus protein database begins : Mon Feb 19 16:21:31 2024

*****diamond blastx agaisnt nr begins : Mon Feb 19 16:21:34 2024

*****Novol virus discovery : Mon Feb 19 16:41:59 2024

*****Virus Identification and taxonomy finished Mon Feb 19 16:44:10 2024

This is the complete log file: log.test.ERR843912.txt

hope to receive your advice, thanks!

Lang

viralInformatics commented 9 months ago

Hello, sorry for the confusion. During the Virus Identification process, we used blast homology matching to identify the viruses in this dataset that had taxid from the NR library. Since the eukaryotic viral sequences we collected were derived from the complete viral genomes in the NCBI refseq and genbank, there may be cases where some of the taxids correspond to viruses for which there is no complete fasta sequence available for extraction or not eukaryotic virus. If you still wish to run the second step of Virus Genome Assemble, you can enter the corresponding taxid in NCBI taxonomy, then jump to the nucleotide database, find the virus sequence, and put it into the . /Ref/sampleID.fa file. We will also continue to update the eukaryotic virus sequences to provide more comprehensive support.

In the meantime, we have made changes to bin/filter.py, so please re-update that file.

Regarding the ERR3253399 example, since it is a known viral reference genome, we directly use this reference genome for the second step of Virus Genome Assemble in order to compare it with other software of the same type, you can download it directly on Github at. /test/Ref/ERR3253399.fa and proceed to the second step of assembly. The example command is as follows: python /data/12T/fp/software/VIGA/bin/0_run2.py \ --clean_fastq_1 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_1.clean.fastq.gz \ --clean_fastq_2 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_2.clean.fastq.gz \ --virus_fasta /data/12T/fp/testVIGA/test/Ref/ERR3253399.fa \ --outdir /data/12T/fp/testVIGA/test/Genome \ --MetaCompass_dir /data/12T/fp/software/MetaCompass \ --threads 20

We hope the above note explains the situation more clearly, and if you have any questions, please contact us.

nyzhoulang commented 9 months ago

Thanks for the reply.

After updating bin/filter.py, the problem has been resolved. And I have the following additional questions:

1 .How to map virus IDs to NCBI IDs

Similar to "977912 refseqid is ['HM030970, HM030971']", how do I find the NCBI ID corresponding to 10757, 2029672, 2079401,...

########################################################################

The identified virus id is as follows: ['10757', '2029672', '2079401', '2202644', '2507125', '2772519', '2825454', '2825709', '2826345', '2827858', '38018', '432261', '93678', '977912'] numbers of virus species: 14

######################################################################## 93678 refseqid is ['JX134044, JX134046, KF764701, KX810063, KX810064, KY856742, MK212031'] 93678 lineage: kViruses;p;c;o;fAnelloviridae;gBetatorquevirus;s__TTV-like mini virus

977912 refseqid is ['HM030970, HM030971'] 977912 lineage: kViruses;pKitrinoviricota;cMagsaviricetes;oNodamuvirales;fNodaviridae;g;s__Orsay virus

10757, 2029672, 2079401, 2202644, 2507125, 2772519, 2825454, 2825709, 2826345, 2827858, 38018, 432261 is not in our database, please download the sequence from NCBI and put it in the ./genome/virus/ directory and named as ERR843912.fa ########################################################################

  1. Bug in RagTag

In step 2 of the testing process, I encountered a bug in RagTag v2.1.0(based Python 3.6.8 ), which I resolved by modifying the code in ragtag_agp2fa.py based on this issue. However, I am concerned that such modifications may affect subsequent data processing. log.step2.ERR3253399.bug.txt ragtag.scaffold.err: image

python /data/12T/fp/software/VIGA/bin/0_run2.py --clean_fastq_1 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_1.clean.fastq.gz --clean_fastq_2 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_2.clean.fastq.gz --virus_fasta /data/12T/fp/testVIGA/test/Ref/ERR3253399.fa --outdir /data/12T/fp/testVIGA/test/Genome --MetaCompass_dir /data/12T/fp/software/MetaCompass --threads 20

debug: ragtag_agp2fa.py line71 sys.stdout.write(fai.fetch(agp_line.comp, agp_line.comp_beg-1, agp_line.comp_end)) change to: sys.stdout.write(str(fai.fetch(agp_line.comp, agp_line.comp_beg-1, agp_line.comp_end)))

Although this resolved the error, the final output results( ./test/Genome/result) are inconsistent with the example

3.The test data ERR3253399 output in Step 2 is inconsistent with example(Github)

After resolving the RagTag bug, I successfully applied step 2.

The my output using ERR3253399 as a test do not match the recorded results in /VIGA-master/test/Genome/result, however, no issues were found in the log.step2.ERR3253399.txt. And my output file does not contain result.fasta

my output: image example output (in github): image

python /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/bin/0_run2.py \
--clean_fastq_1 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_1.clean.fastq.gz \
--clean_fastq_2 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_2.clean.fastq.gz \
--virus_fasta /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Ref/ERR3253399.fa \
--outdir /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399 \
--MetaCompass_dir /data3/Group7/zhoulang/FMT_GV/VIGA/MetaCompass-master \
--threads 64

Thanks, Lang

viralInformatics commented 9 months ago
  1. Regarding the problem that there is no corresponding sequence in some taxid, as we said before, our library only contains complete eukaryotic viral sequences, if you have a more accurate reference genome, you can first search for the taxid in the NCBI Taxonomy database, and click on the red box in the picture below to jump to the Nucleotide database to get the nucleic acid sequences of the viruses, e.g. taxid:10757 for phage. 1708430386609

2&3. I checked your log file and found that the reference genome is used correctly, you can try exporting the results to a new folder maybe it will be different. I also retested the results and uploaded to github with the same results. 1708432382917

The command is modified as follows: python /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/bin/0_run2.py \ --clean_fastq_1 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_1.clean.fastq.gz \ --clean_fastq_2 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_2.clean.fastq.gz \ --virus_fasta /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Ref/ERR3253399.fa \ --outdir /data3/Group7/zhoulang/FMT_GV/VIGA/test/New.ERR3253399 \ --MetaCompass_dir /data3/Group7/zhoulang/FMT_GV/VIGA/MetaCompass-master \ --threads 64

nyzhoulang commented 9 months ago

Thanks,
glad to receive your reply.

Following your advice, I successfully resolved Q1.

1 Regarding the problem that there is no corresponding sequence in some taxid, as we said before,

Abuout Q2&Q3, I fixed it by change the code in ragtag_agp2fa.py.

The problem lies in Ragtag exporting sequences in an incorrect format, causing MetaQuast to be unable to recognize them. However, MetaQuast's error message will not be displayed in the VIGA log.

metaquast.log:

Version: 5.2.0

System information: OS: Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core (linux_64) Python version: 3.6.8 CPUs number: 112

Started: 2024-02-20 18:34:51

Logging to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399/metaquast/MK720628.1/metaquast.log NOTICE: Maximum number of threads is set to 28 (use --threads option to set it manually)

Reference(s): /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399/ref/MK720628.1.fa ==> MK720628.1 All references were combined in combined_reference.fasta

Contigs: Pre-processing... WARNING: Skipping ragtag.scaffold.fasta1 because it doesn't contain contigs >= 0 bp.

ERROR! None of the assembly files contains correct contigs. Please, provide different files or decrease --min-contig threshold.

Perhaps it would be advisable to add a prompt in VIGA to locate MetaQuast's error message.

Thanks sincerely again for your answer, Lang