Discrepancies genome size

ruanjue / smartdenovo

Ultra-fast de novo assembler using long noisy reads

GNU General Public License v3.0

127 stars 29 forks source link

Discrepancies genome size #36

Closed begsch closed 5 years ago

begsch commented 5 years ago

Hi,

Thanks for developing smartdenovo and wtdbg2. I am actually working on a very heterozygous insect genome. The smartdenovo log file returns a genome size of around 400 Mb (which is close to the estimated haploid genome size) whereas the consensus (cns) sequence is around 180Mb. Is it possible to parse the layout files in order to retrieve all FASTA sequences belonging to the 400Mb genome assembly?

P.S.: I am also trying to optimise wtdbg2 parameters on my read data (right now I get rather messy results).

Thank you very much in advance, Ben

ruanjue commented 5 years ago

grep the seq headers in lay file, and see the approxiate assembly size. BTW, wtdbg-2.4 -R will helps.

bnwaweru commented 5 years ago

hi runjue,

am suing smartdenovo to assemble a a plant genome from nanopore reads with an estimated genome size of about 600-700MB from jellyfish estimation. The log file reports a genome size and coverage. Could you share some light on how smartdenovo calculated the genome size. This is because the genome size from quast results of the dmo.lay.utg (292,207,975) is about half of what is reported by smartdenovo (510,489,103). Thanks

ruanjue commented 5 years ago

In wtclp, for each read, I count the alignment coverage, then sum all to estimate the mean sequencing coverage, GSize = sum(read_length) / mean_seq_cov.

SMARTdenovo is no more updated, if the assemly size was smaller, please try wtdbg-2.4 or other assembers.

Jue

bnwaweru commented 5 years ago

Thanks Jue,

i will try wtdbg