marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.
300 stars 29 forks source link

Assigned reads for each phased contig #288

Closed luanelandau closed 1 month ago

luanelandau commented 1 month ago

I am looking to locally reassemble a locus of interest using the phased reads from verkko. I would like to find the file that indexes which reads make up each of the phased contigs (paternal and maternal). How can I find that information, is it in any default output? Thank you!

Dmitry-Antipov commented 1 month ago

Hi, We have this information for all HiFi reads, but only for few of ONT reads - those ones used to close gaps in HiFi

You'll need 6-layoutContigs/unitig-popped.layout and 6-layoutContigs/unitig-popped.layout.scfmap (for runs with hi-c/pore-c should use 8-hicPipeline/final_contigs/6-layoutContigs/unitig-popped.layout & 8-hicPipeline/final_contigs/6-layoutContigs/unitig-popped.layout.scfmap)

.layout.scfmap will give you mapping between scaffold names of form haplotype-smth to several pieces pieceXXXXX (just splitting on gaps), and .layout will give you lists of reads assigned to each piece.

For most of ONT reads we do not have such assignment. There is some indirect data, but actually I'd just realign them to the assemblies.

skoren commented 1 month ago

I'll add that both the *.layout and the *.scfmap files will be promoted to the top-level assembly folder so you don't nee to worry about the 6-layout or 8-hicPipeline distinction. Just grab assembly.scfmap and assembly.homopolymer-compressed.layout files. The scfmap will look like:

path haplotype1-0000001 haplotype1_from_utig4-0
piece000001
end
path haplotype2-0000002 haplotype2_from_utig4-3
piece000002
end

while the layout will look like:

tig     piece000001
len     2866921
rds     6298
m84005_220827_014912_s1/71567556/ccs    18121   0
m84039_230418_213342_s3/19269101/ccs    17844   5246
m84039_230418_213342_s3/225903430/ccs   20947   5506
m84005_220827_014912_s1/126684173/ccs   17954   7546
m84005_220827_014912_s1/263785287/ccs   22734   9688
m84039_230418_213342_s3/33034490/ccs    13474   26619
m84039_230418_213342_s3/142280257/ccs   13525   27248
...
end

So if you want to find reads that belong to phased contig haplotype1-0000001, look at reads in piece000001 in the layout. Note that some paths may have multiple pieces in case of gaps in the assembly.

If you want a friendlier way to get this information, including for all ONT reads, you can add the --consensus-bam option to verkko (assuming you're on v2.2). This will make an assembly.bam with each of the pieces having all their HiFi and ONT reads aligned to the final consensus sequence. This would require re-running part of the assembly though.