metagentools / VStrains

VStrains is a de novo approach for reconstructing strains from viral quasispecies.
MIT License
23 stars 6 forks source link

Fail to assembly de-novo genome #2

Open Dv1t opened 1 week ago

Dv1t commented 1 week ago

Hello VStrains team, thank you for developing such a great tool, but while using it, I faced the following problem. I attempted to assemble the complete HIV genome from this sample: SRR29407826. I used corona-spades and it worked fine. However VStrains crashes when assembling it. First, there was an error related to rev_dict in VStrains_PE_Inference.py. It doesn't had lowercase nucleotides in it and therefore raised KeyError. I fixed it replacing: rev_dict = {"A": "T", "T": "A", "C": "G", "G": "C"} with this: rev_dict = { "A": "T", "T": "A", "C": "G", "G": "C", "a": "t", "t": "a", "c": "g", "g": "c" } But new issue occurred, after messages in CLI log:

----------------------Paired-End Information Alignment---------------------- Start aligning reads to gfa nodes Number of processed reads: 0

It freezes forever and do not proceed any further.

Worth mentioning details In the same log there is a suspicious message:

INFO - graph kmer size: 0

Also VStrains can't read _assembly_graph_aftersimplification.gfa file (which is the output of spades) without changing its version in header from 1.2 to 1.0 manually.

Steps to reproduce

  1. Assembly with spades:
    spades.py --corona -1 SRR29407826_1.fastq -2 SRR29407826_2.fastq -o spades_G_SRR29407826
  2. Start VStrains:
    vstrains -a spades -g spades_G_SRR29407826/assembly_graph_after_simplification.gfa \
    -p spades_G_SRR29407826/contigs.paths \
    -o vstrains_G -fwd SRR29407826_1.fastq -rve SRR29407826_2.fastq

Files with reads:. SRR29407826.zip VStrains log: vstrains.log Spades log: spades.log

RunpengLuo commented 3 days ago

Hi, Thanks for trying VStrains and sorry for the late reply,

  1. the GFA version conflicts with the external python library for parsing GFA file (gfapy) and it is not up-to-update with GFA version 1.2, I'll try to fix it later on but currently changing the version to 1.0 manually from GFA file should work for parsing. Thanks a lot for pointing this out!

  2. VStrains didn't test with coronaSPAdes but mainly on SPAdes. I've ran your dataset with SPAdes (common version) + VStrains. It might not be helpful to run VStrains with coronaSPAdes since it already report a single collapsed strain and the graph structure doesn't have edges to further process either. I've attached the result&Bandage visualization if it might be helpful.

Feel free to let me know if there are further questions, John

out_SRR29407826.zip

Dv1t commented 1 day ago

I've also tried --rnaviral option of spades and there are two types of outcome with VStrains in case kmer=0:

  1. Strain assembled (but same as scaffold of spades)
  2. VStrain runs forever

Attaching logs and files for both cases: First: vstrains_kmer_0_success.log spades_kmer_0_success.log kmer_0_success_reads.zip

Second: vstrains_kmer_0_fail.log spades_kmer_0_fail.log kmer_0_fail_reads.zip