marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
307 stars 30 forks source link

yields twice the genome size #152

Closed Zero-Sun closed 1 year ago

Zero-Sun commented 1 year ago

Hi, I assembled a diploid fish genome using ccs and ont data via verkko. assembly.fasta is twice the size of the target. I still have hic data. I read "Using Hi-C reads"#62 and "Using phasing blocks generated by hifiasm+Hi-C"#136 etc. But I'm still not clear how to get the genome of the haplotype. I saw you said "We should have a better solution by the end of this year." and "We should hopefully have an integrated version of Hi-C in the verkko pipeline within a month." I don't know if there is a newer and better solution, or how should I do it now? Thank you! From a bioinformatics novice

skoren commented 1 year ago

Approximately twice the genome size is expected, since w/o long-range phasing information verkko will generate phased unitigs for both haplotypes.

@Dmitry-Antipov can comment but there is a HiC pipeline available w/in the master branch. You'd have to install verkko from source to get that version or wait for a release. Alternatively, you can use the pipeline at: https://github.com/Dmitry-Antipov/verkkohic, see nopstools_wrapper.sh script for details. The HiC integration has mostly been tested on mammals so far so we're definitely interested to see how it works on other genomes.

skoren commented 1 year ago

Idle, somewhat duplicated with #157 which has been added as a feature request. With Hi-C data, the latest version will phase and output two haplotypes as expected.