maickrau / GraphAligner

MIT License
256 stars 30 forks source link

Corrected read coverage (number of bases) ~3x higher after aligning raw reads to unitig graph #50

Closed jelber2 closed 2 years ago

jelber2 commented 2 years ago

Hi,

Preface: So as these are not real reads, I am not sure about whether this is a bug or merely a factor of simulation.

I simulated 150x Illumina read coverage (I am not sure the details of that are too important; used ReSeq) and 100x Nanopore read coverage (using pbsim2) for an Ecoli genome.

I then took the raw Illumina reads and used bcalm2 to make a unitig graph (shortreads.unitigs.gfa.gz).

Finally, I took GraphAligner Github commit # 02c8e26 and tried the following:

~/bin/GraphAligner/bin/GraphAligner -g shortreads.unitigs.gfa -f sd_0001.fa \
--corrected-clipped-out sd_0001.graphaligner.bcalm.fa -x dbg --threads 20

but end up with 1,322,106,902 bases in the corrected-clipped-out but 464,165,200 bases in original, simulated reads (sd_0001.fa.gz)

The interesting thing is things seem to work fine with de novo assembly with flye v. 2.9-b1774 using the nano-corr option if I randomly subsample back down the corrected-clipped-out output to ~100 coverage with rasusa, as there are no detectable differences between minimap2 alignment to the reference genome (NCBI accession: NC_000913.3).

So I guess my question is how can there be ~3x more bases using corrected-clipped-out (similar for corrected-out)?

jelber2 commented 2 years ago

Hmmm. I do not have this issue anymore with commit 9d60782c