More fragmented assembly after updating from version 1.3.0

lbcb-sci / raven

De novo genome assembler for long uncorrected reads

MIT License

202 stars 21 forks source link

More fragmented assembly after updating from version 1.3.0 #50

Open ilyavs opened 3 years ago

ilyavs commented 3 years ago

Hello, I have been using raven for a while and recently I reran an assembly of the same bacterial data with a newer version of raven and got a more fragmented genome. With version 1.3.0 I got the complete bacterial genome in one contig. With any later version I got the genome more fragmented and with a smaller total assembly size. Is it possible to keep the improvements done in recent raven versions but restore the better contiguity observed in version 1.3.0? Sorry but I can't share the data. Thanks, Ilya.

rvaser commented 3 years ago

Hi Ilya, which versions have you tried so far? What data type do you have and how fragmented is the assembly? From version 1.4.x, bubble similarity check via minimizers was replaced with alignments, while versions 1.5.x have different repeat annotations to save execution time.

Best regards, Robert

ilyavs commented 3 years ago

Hi, Version 1.3.0 produced a 2.8 Mbp staph aureus genome. I tried versions 1.4.0, 1.5.1 and 1.5.3 (all via the docker images on quay.io). These versions were unable to produce the 2.8 Mbp genome contig. The largest contig was around 1 Mbp. The data type is minion nanopore sequencing basecalled with guppy 4.2.2. The dataset has 3.6e8 bp in the fastq file. Best, Ilya.

rvaser commented 3 years ago

The data set seems it has enough coverage and not too bad accuracy, not sure why the latter versions do not work as 1.3.0. You could try v1.6.0 from branch options (you can also try different k,w values). Sorry for my delayed reply.

ilyavs commented 3 years ago

Can you please elaborate on how the k and w values are expected to affect the assembly? When do you expect to have the next version released to bioconda? Thanks, Ilya.

rvaser commented 3 years ago

I have create a new release, it will be picked up automatically by bioconda soon.

Regarding parameters, I think you can first try with k = 19. We have recently evaluated higher k values (up to 25) on Guppy 5 data, which has tendency to increase contiguity. Earlier Raven versions used (k, w) = (29, 9) (option --weaken, now removed) for HiFi data to improve assembly. I am not sure how it will affect Guppy 4.x datasets, but your dataset is quite small so you can try a couple of values around the default (k, w) = (15, 5).

ilyavs commented 3 years ago

Thank you for the new release and information. Version 1.6.0 assembled the complete 2.8 Mbp genome but failed to circularize the chromosome while version 1.3.0 assembled the complete 2.8 Mbp genome and circularized the chromosome. In version 1.6.0 increasing the k value resulted in shorter largest contig. In version 1.5.3 running with --weaken resulted in a 2.7 Mbp non circular largest contig. So for now, it seems that version 1.3.0 is still the best option for my data, although version 1.6.0 comes close.