mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
760 stars 165 forks source link

assembly graph with several collapse contigs #716

Closed pguenzi-tiberi closed 1 week ago

pguenzi-tiberi commented 1 month ago

Hello everyone,

First of all, thank you very much for what you have done and are currently doing for the community with this tool!

Over the last month, we have assembled the genomes of two different strains of the same species (a green alga) based on phylogenic markers. For each strain, I used Inspector (https://github.com/ChongLab/Inspector) to check the quality of each assembly. For one strain, it seems that everything is correct (Inspector didn't detect any errors and the graph is weird just for one big contig, you can see it just below this line). image

For the other strain, the graph is very odd. It looks like a lot of sequences are shared between a lot of edges and I don't understand if this is biological or a flye error. Inspector has detected collapsed contigs. Do these weird things represent "collapsed"? How do we solve this problem? image

For the first assembly, I used Hifi pacbio reads. For the second, I used CLR Pacbio reads (i.e. no Hifi). First command line: flye --pacbio-hifi /bettik/guenzitp/data/HiFi.fastq.gz -i 1 -t 16 --out-dir ./flye_assembly_first Second command line : flye --pacbio-raw /bettik/guenzitp/data/subreads.fasta.gz -i 1 -t 16 --out-dir ./flye_assembly_second

Thank you very much !

mikolmogorov commented 1 month ago

Hi,

The collapsed edges likely represent unresolved repeats. The high-degree tangles in the assembly graph likely represent telomeres. The differences in graphs are likely because you are using HiFi mode for the first assembly, and CLR mode for the 2nd, they are pretty different in terms of assembly parameters. I don't think there is something wrong with the assembly.

Misha

mikolmogorov commented 1 week ago

Assuming this is resolved, feel free to follow up if you have more questions!