final assmelby much smaller tha raw assembly

jdmontenegro commented 6 years ago

Hi all, First of all thank for an amazing tool, I have been using Flye quite successfully for assembly mammalian genomes using relatively low coverage PacBio reads (~30X). However, recently I tried an assembly of a non-model species with an expected haploid genome size of 2.2 Gbp (flow cytometry) and using 35X coverage of PacBio reads with an average read legnth of 3.4 Kbp. I am using Flye 2.3 with the following command line:

flye --pacbio-raw ${reads} -g 2.5g -m 1000 -o ${OD} -t 64 -i 3

The raw assembly and consensus stages produced assemblies of ~2.1 Gbp, but after the repeat solving and polishing steps the final assembly is only 781 Mbp (roughly 1/3). The genome is expected to be quite repetitive (~50% of simple sequence repeats and transposable elements), but that still does not quite explain such big difference between the raw assembly and the final polished assembly.

Is this expected behaviour? Is there any parameter that can be tweaked to improve the final assembly?

I look forward to to hearing back from you, any suggestions would be more than welcome.

Kind regards,

Juan Montenegro

mikolmogorov commented 6 years ago

Hi Juan,

Most like what happened was that the genome indeed has high repetitive content, and these repeats were collapsed in the repeat graph and remained unresolved. What was the intuition behind selection minimum overlap 1000? What was the automatically selected value (based on reads N50)? In general, you want to keep this parameter as high as possible to avoid extra complexity in the repeat graph.

We are also working on the new version now, that we think should handle highly repetitive genome better - I can let you know as soon as we have a relatively stable beta version.

jdmontenegro commented 6 years ago

Hi Mikhail, Thanks for your reply. The average read length for my dataset was 3.4kbp, that's why I used a smaller overlap value. I did not check what was the automatically assigned value. I can check in the next run. I was surprised by the sharp reduction, because SSR are usually smaller than 1kb, so I figured 1kb overlaps would use most reads while at the same time solving these kind of repeat which are the dominant repeat in the genome. Besides letting flyer select its own overlap value, would you suggest any other change to the command? I look forward to the new version!!! Best regards, Juan Montenegro

On 2 Jul. 2018 5:38 am, "Mikhail Kolmogorov" notifications@github.com wrote:

Hi Juan,

Most like what happened was that the genome indeed has high repetitive content, and these repeats were collapsed in the repeat graph and remained unresolved. What was the intuition behind selection minimum overlap 1000? What was the automatically selected value (based on reads N50)? In general, you want to keep this parameter as high as possible to avoid extra complexity in the repeat graph.

We are also working on the new version now, that we think should handle highly repetitive genome better - I can let you know as soon as we have a relatively stable beta version.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/51#issuecomment-401628242, or mute the thread https://github.com/notifications/unsubscribe-auth/AI8lupytDghGZ66DsdFX14utMA_oN5Auks5uCSVIgaJpZM4U1YFr .

jdmontenegro commented 5 years ago

Hi all, I repeated the assembly with the default minoverlap (5 Kbp) using the same dataset. Again the genome size was greatly reduced from 900Mbp after the consensus stage to 350Mbp after the repeat and 343 Mbp after polishing.

The command used was the following: '''flye --pacbio-raw ${reads} -g 2.2g -o ${OD} -t 64 -i 2''' I have 35X coverage of raw pacbio reads for a 2.2Gbp genome. This assembly represents less than 1/5 the expected genome size, so it looks worse now than it did when I used a minimum overlap of 1 Kbp. Is there any way to improve this assembly?

Kind regards

Juan Montenegro

mikolmogorov commented 5 years ago

Could you send me the log file from the last run?

jdmontenegro commented 5 years ago

Dear Mikhail, Please find the lopg attached to this reply. flye.log.tar.gz

mikolmogorov commented 5 years ago

Hi Juan,

I looked through the log - there is definitely a high fragmentation of the draft assembly, most likely because of the read length / min overlap inconsistency. I was assuming that you are running the latest 2.3.3 version, which should have selected minimum overlap to 3k, but not 5k. In the end, it seems that only ~30% of reads were aligned - not a good sign as well. But this also suggest that if ~340mb of assebmled final sequence correspond to ~30% of reads, then the genome size should be around 1Gb. Are you sure about your 2Gb estimate?

I have just pushed the new 2.3.5b version into the flye branch (it is not in releases yet). I suggest to run it with the minimum overlap 2000 and see how it performs. Once it finishes you can send me the new log again, so I can comment if there is anything suspicious in the run.

jdmontenegro commented 5 years ago

Dear Mikhail, Thank you for your email and the time taken to follow this up. I have just submitted a new job using 3kb for the minimum overlap. I have discussed your email with a few colleagues and the evidence suggests that the genome is in fact 2.2 Gbp long. This is based on flow cytometry and genome survey sequencing of a fosmid library. Additionally, kmer analysis of illumina data (60x) also point to 1.8-2.4gbp genome. Speaking to the lab team, it seems that the raw PB reads were produce from relatively few templates (several passes over each template). And it seems that the effective non-refundable coverage is actually 18x of the pacbio data even though we have 35x coverage in total. Do you think this is also an issue for the assembler?

Kind regards,

Juan Montenegro

On Fri., 20 Jul. 2018, 12:54 pm Mikhail Kolmogorov, < notifications@github.com> wrote:

Hi Juan,

I looked through the log - there is definitely a high fragmentation of the draft assembly, most likely because of the read length / min overlap inconsistency. I was assuming that you are running the latest 2.3.3 version, which should have selected minimum overlap to 3k, but not 5k. In the end, it seems that only ~30% of reads were aligned - not a good sign as well. But this also suggest that if ~340mb of assebmled final sequence correspond to ~30% of reads, then the genome size should be around 1Gb. Are you sure about your 2Gb estimate?

I have just pushed the new 2.3.5b version into the flye branch (it is not in releases yet). I suggest to run it with the minimum overlap 2000 and see how it performs. Once it finishes you can send me the new log again, so I can comment if there is anything suspicious in the run.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fenderglass/Flye/issues/51#issuecomment-406474090, or mute the thread https://github.com/notifications/unsubscribe-auth/AI8luperrVXIwl66uWsXBC2AKeKcJF4rks5uIUZ8gaJpZM4U1YFr .

mikolmogorov commented 5 years ago

Dear Juan,

If you are running Flye on the split subreads, it should not be an issue. On the other hand, we do not take advantage of ccs technology either. If a significant number of templates have multiple subreads covering them, you might actually consider to call consensuses (using the PacBio tools) and then run Flye on the corrected reads. However, it is currently not possible to mix reads with significant difference in error rate (such as raw subreads and ccs consensuses).

mikolmogorov commented 5 years ago

Closing this one due to inactivity.

mikolmogorov / Flye

final assmelby much smaller tha raw assembly #51