mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
760 stars 165 forks source link

tuning parameters to avoid collapsing? #165

Closed dcopetti closed 4 years ago

dcopetti commented 4 years ago

Hi, I am assembling a heterozygous 2.5 Gb (haploid size, so we expect a ~5 Gb assembly) plant genome. We have about 35x coverage per allele and many copies of a very few repeats (LTR-retrotransposons). The assembly is ongoing now, but I wonder if for the next iterations I should/can tweak some of these parameters

[2019-10-04 14:58:11] DEBUG:    assemble_ovlp_relative_divergence=0.10
[2019-10-04 14:58:11] DEBUG:    repeat_graph_ovlp_divergence=0.15
[2019-10-04 14:58:11] DEBUG:    read_align_ovlp_divergence=0.25
[2019-10-04 14:58:11] DEBUG:    max_coverage_drop_rate=5

to reduce the collapsing of repeats into one node and the collapsing of the two allelic sequences into one contig. Would you touch something, with the goal of being more selective when finding overlaps? Would the coverage be of any help to identify the two sets of allelic sequences in a pile of alignments? Thanks, Dario

mikolmogorov commented 4 years ago

Hi Dario,

I wouldn't recommend to tweak those - otherwise it might cause chimeric connections from disjointigs propagated to the graph. What is the expected divergence rate? If is relatively high, Flye should be able to reconstruct the alternative alleles with the default paramteres. The assembly will be more fragmented though, since Flye currently does not do pseudo-haplotyping (as in Falcon).

I'd wait for assembly to complete and see how it looks. Coverage will definitely be helpful in identifying haploid/diploid contigs.

Mikhail

mikolmogorov commented 4 years ago

I'm closing the issue for now, but feel free to reopen if you have more questions about this assembly later!