mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
767 stars 165 forks source link

Assembling multiple sub-regions at once #475

Closed GuillaumeHolley closed 2 years ago

GuillaumeHolley commented 2 years ago

Hi,

I am looking for some advice with Flye. I am interested in assembling different regions of a human sample. These regions are quite small and rarely exceed 20kb. Because there are quite a few of these regions and quite a few samples to process, I extend the regions of interest of 50kb on each side, I align my long reads to hg38 and extract only the reads overlapping all 50-kb extended regions I am interested in (so even if a region is 1bp long, I extract 100kb worth of reads centered on that region). Now I send the extracted reads all together to assemble with Flye 2.9 for each sample and this works well most of the time. However, there are some regions which do not assemble though, I do not get any contigs for these, But if I assemble only the reads overlapping one specific (extended) region, I do get a contig. So I assume that when I do my first "all regions" assembly, some reads from different regions "conflict" with each other and it is why some regions do not get any contigs.

I was wondering if you had any tip on how to improve this? Let me know if any of this is unclear.

Thanks, Guillaume

mikolmogorov commented 2 years ago

Hi Guillaume,

My first guess would be that some of your regions belong to repetitive parts of the genome (segmental duplications?). When you are trying to assemble all regions at once, multiple copies of one region may get collapsed if there is not enough difference between them (for raw ont reads this could be up to 5-10% difference for collapse). When using a reference, the aligner can take advantage of these small differences between the copies and place the reads more accurately.

Could that be the case?

Mikhail

GuillaumeHolley commented 2 years ago

Hi Mikhail,

You are correct that many of these regions are overlapping Segmental Duplications and I am fairly confident that multiple copies of one region get collapsed indeed. I thought at first that maybe 50kb extension wasn't enough since SDs can be much longer than 50kb. I have merged my region of interest with the overlapping SD regions from the UCSC Genome Browser. I have then extended these new merged regions of 50kb on each side and repeated the process (extract reads from primary, secondary and supplementary alignments overlapping ROIs -> assemble) but the same unassembled regions remain unassembled. I am using Illumina-corrected ONT reads here so there is obviously the risk that the correction is incorrect but it seems good in general (compared to the PacBio HiFi data we have for some of the same samples). I tried different assembly preset (nano-raw and nano-hq) and also --meta but none really made a difference. I tried exactly the same process (map reads -> extract reads -> assemble) from PacBio data and it worked much better. Hence, I assume the ONT reads, even corrected, have a too high error rate for these regions that do not assemble. It doesn't seem there is much more I can do with the data I have so unless you have some suggestions, I'll close this issue.

Thank you for your feedback!

Guillaume

mikolmogorov commented 2 years ago

@GuillaumeHolley makes sense. By PacBio do you mean HiFi? Also, what is the estimated error for error-corrected ONT reads? I would try a combination of --nano-corr and --read-error options to tune up for the expected read error rate.

Also, if you are recruiting secondary alignment - you are likely bringing reads from other copies of this SD. I would try with just primary and supplementary (maybe also require MAPQ > 10 or 20).

GuillaumeHolley commented 2 years ago

@fenderglass PacBio HiFi yes. For the error-corrected reads, I don't have the numbers with me but it was something like 5-6% in those difficult regions (and around 2-3% genome-wide). Thanks for the tip on these options, I will give it a shot tomorrow.

Regarding secondary alignments, don't I want to bring in reads from other copies of this SD? Otherwise, I risk assembling only some copies of the SD or even misassemble the SD (I mean, right now it is not assembling at all so this might not be a bad thing).

mikolmogorov commented 2 years ago

@GuillaumeHolley depending on your goals. If you are assembling all regions at the same time, then secondary alignments have to have matching primaries on another SD copy. If you are interested in a single SD region, it might make sense to focus on primary only.

mikolmogorov commented 2 years ago

Marking as resolved - but feel free to follow up if you have more questions.