Closed nyzhoulang closed 9 months ago
Hello, sorry for the confusion. During the Virus Identification process, we used blast homology matching to identify the viruses in this dataset that had taxid from the NR library. Since the eukaryotic viral sequences we collected were derived from the complete viral genomes in the NCBI refseq and genbank, there may be cases where some of the taxids correspond to viruses for which there is no complete fasta sequence available for extraction or not eukaryotic virus. If you still wish to run the second step of Virus Genome Assemble, you can enter the corresponding taxid in NCBI taxonomy, then jump to the nucleotide database, find the virus sequence, and put it into the . /Ref/sampleID.fa file. We will also continue to update the eukaryotic virus sequences to provide more comprehensive support.
In the meantime, we have made changes to bin/filter.py, so please re-update that file.
Regarding the ERR3253399 example, since it is a known viral reference genome, we directly use this reference genome for the second step of Virus Genome Assemble in order to compare it with other software of the same type, you can download it directly on Github at. /test/Ref/ERR3253399.fa and proceed to the second step of assembly. The example command is as follows: python /data/12T/fp/software/VIGA/bin/0_run2.py \ --clean_fastq_1 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_1.clean.fastq.gz \ --clean_fastq_2 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_2.clean.fastq.gz \ --virus_fasta /data/12T/fp/testVIGA/test/Ref/ERR3253399.fa \ --outdir /data/12T/fp/testVIGA/test/Genome \ --MetaCompass_dir /data/12T/fp/software/MetaCompass \ --threads 20
We hope the above note explains the situation more clearly, and if you have any questions, please contact us.
Thanks for the reply.
After updating bin/filter.py
, the problem has been resolved.
And I have the following additional questions:
1 .How to map virus IDs to NCBI IDs
Similar to "977912
refseqid is ['HM030970
, HM030971
']",
how do I find the NCBI ID corresponding to 10757, 2029672, 2079401,...
########################################################################
The identified virus id is as follows: ['10757', '2029672', '2079401', '2202644', '2507125', '2772519', '2825454', '2825709', '2826345', '2827858', '38018', '432261', '93678', '977912'] numbers of virus species: 14
######################################################################## 93678 refseqid is ['JX134044, JX134046, KF764701, KX810063, KX810064, KY856742, MK212031'] 93678 lineage: kViruses;p;c;o;fAnelloviridae;gBetatorquevirus;s__TTV-like mini virus
977912 refseqid is ['HM030970, HM030971'] 977912 lineage: kViruses;pKitrinoviricota;cMagsaviricetes;oNodamuvirales;fNodaviridae;g;s__Orsay virus
10757, 2029672, 2079401, 2202644, 2507125, 2772519, 2825454, 2825709, 2826345, 2827858, 38018, 432261 is not in our database, please download the sequence from NCBI and put it in the ./genome/virus/ directory and named as ERR843912.fa ########################################################################
In step 2 of the testing process, I encountered a bug in RagTag v2.1.0
(based Python 3.6.8 ), which I resolved by modifying the code in ragtag_agp2fa.py
based on this issue. However, I am concerned that such modifications may affect subsequent data processing. log.step2.ERR3253399.bug.txt
ragtag.scaffold.err:
python /data/12T/fp/software/VIGA/bin/0_run2.py --clean_fastq_1 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_1.clean.fastq.gz --clean_fastq_2 /data/12T/fp/testVIGA/test/Fastp/ERR3253399_2.clean.fastq.gz --virus_fasta /data/12T/fp/testVIGA/test/Ref/ERR3253399.fa --outdir /data/12T/fp/testVIGA/test/Genome --MetaCompass_dir /data/12T/fp/software/MetaCompass --threads 20
debug:
ragtag_agp2fa.py line71
sys.stdout.write(fai.fetch(agp_line.comp, agp_line.comp_beg-1, agp_line.comp_end))
change to:
sys.stdout.write(str(fai.fetch(agp_line.comp, agp_line.comp_beg-1, agp_line.comp_end)))
Although this resolved the error, the final output results( ./test/Genome/result) are inconsistent with the example
3.The test data ERR3253399
output in Step 2 is inconsistent with example(Github)
After resolving the RagTag bug, I successfully applied step 2.
The my output using ERR3253399 as a test do not match the recorded results in /VIGA-master/test/Genome/result, however, no issues were found in the
log.step2.ERR3253399.txt. And my output file does not contain result.fasta
my output: example output (in github):
python /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/bin/0_run2.py \
--clean_fastq_1 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_1.clean.fastq.gz \
--clean_fastq_2 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_2.clean.fastq.gz \
--virus_fasta /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Ref/ERR3253399.fa \
--outdir /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399 \
--MetaCompass_dir /data3/Group7/zhoulang/FMT_GV/VIGA/MetaCompass-master \
--threads 64
Thanks, Lang
2&3. I checked your log file and found that the reference genome is used correctly, you can try exporting the results to a new folder maybe it will be different. I also retested the results and uploaded to github with the same results.
The command is modified as follows: python /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/bin/0_run2.py \ --clean_fastq_1 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_1.clean.fastq.gz \ --clean_fastq_2 /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Fastp/ERR3253399_2.clean.fastq.gz \ --virus_fasta /data3/Group7/zhoulang/FMT_GV/VIGA/VIGA-master/test/Ref/ERR3253399.fa \ --outdir /data3/Group7/zhoulang/FMT_GV/VIGA/test/New.ERR3253399 \ --MetaCompass_dir /data3/Group7/zhoulang/FMT_GV/VIGA/MetaCompass-master \ --threads 64
Thanks,
glad to receive your reply.
Following your advice, I successfully resolved Q1.
1 Regarding the problem that there is no corresponding sequence in some taxid, as we said before,
Abuout Q2&Q3, I fixed it by change the code in ragtag_agp2fa.py.
The problem lies in Ragtag
exporting sequences in an incorrect format, causing MetaQuast
to be unable to recognize them. However, MetaQuast's error message will not be displayed in the VIGA
log.
metaquast.log:
Version: 5.2.0
System information: OS: Linux-3.10.0-693.5.2.el7.x86_64-x86_64-with-centos-7.4.1708-Core (linux_64) Python version: 3.6.8 CPUs number: 112
Started: 2024-02-20 18:34:51
Logging to /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399/metaquast/MK720628.1/metaquast.log NOTICE: Maximum number of threads is set to 28 (use --threads option to set it manually)
Reference(s): /data3/Group7/zhoulang/FMT_GV/VIGA/output_test/Genome.ERR3253399/ref/MK720628.1.fa ==> MK720628.1 All references were combined in combined_reference.fasta
Contigs: Pre-processing... WARNING: Skipping ragtag.scaffold.fasta1 because it doesn't contain contigs >= 0 bp.
ERROR! None of the assembly files contains correct contigs. Please, provide different files or decrease --min-contig threshold.
Perhaps it would be advisable to add a prompt in VIGA to locate MetaQuast's error message.
Thanks sincerely again for your answer, Lang
Hi,
VIGA is a highly anticipated tool.
I used
ERR843912
as a test. After completingStep 1: Virus Identification
, I found that the generatedoutput_test/Ref/ERR843912.fa
file was empty, which caused an error in Step 2.This is the command I used to run Step 1:
And, the
log
indicates that some viral strains need to be downloaded from NCBI and named ERR843912.fa, but I don't know which strains correspond to these IDs.This is the complete log file: log.test.ERR843912.txt
hope to receive your advice, thanks!
Lang