ruanjue / smartdenovo

Ultra-fast de novo assembler using long noisy reads
GNU General Public License v3.0
127 stars 29 forks source link

Assembly versus genome size #47

Closed ChuShin closed 4 years ago

ChuShin commented 4 years ago

Dear RuanJue,

I noticed in the end of wtclp step there is an estimation of genome size, e.g. in this case 1.0GBp (shown below), which is the size we expect for the genome. However the output from wtlay (with parameters: -w 300 -s 200 -m 0.1 -r 0.97 -c 1 ) has only 898Mb total assembly size - is there any way to uncover the missing ~100Mb (10%) sequence to see what they are?

Thank you in advance.

Regards, Chu Shin

-- Total aviable sequences: 51310363234 bp Average Coverage(?): 51 Genome Size(?): 1006085553 bp

--

ruanjue commented 4 years ago

You were meaning the genome size estimated by SMARTdenovo? It was not accurate. I still do not have a good way to estimate genome size using long noisy reads.

ChuShin commented 4 years ago

Hello, So far in our experience the genome size estimated by SMARTdenovo have been surprisingly good in the few plant genomes we sequenced (sizes from 500Mb-16Gb). However, the final assembly output (*.dmo.lay.utg) are always ~10-20% smaller, where the percentage varies from case to case. I am curious whether this is because the software filters out less-reliable unitigs and we might be able to play with filtering stringency, but sounds like that may not be the case.

Thank you for the reply.

Regards, Chu Shin