Output noseq gfa for graph visualization in bandage

marbl / verkko

Telomere-to-telomere assembly of accurate long reads (PacBio HiFi, Oxford Nanopore Duplex, HERRO corrected Oxford Nanopore Simplex) and Oxford Nanopore ultra-long reads.

273 stars 27 forks source link

Output noseq gfa for graph visualization in bandage #73

Closed baozg closed 2 years ago

baozg commented 2 years ago

Hi,

Can verkko output the noseq.gfa for looking at the graph with bandage? For large genome, the sequence info cannot load into bandage easily.

skoren commented 2 years ago

If you ran with a trio, there will be a noseq.gfa under 6-rukki. Otherwise, you can convert a sequence gfa to noseq using awk: cat assembly.homopolymer-compressed.gfa |awk '{if (match($1, "^S")) print $1"\t"$2"\t*\tLN:i:"length($3); else print $0}' > assembly.homoplymer-compressed.noseq.gfa

We should update the pipeline to always generate the noseq.gfa though.

baozg commented 2 years ago

Thank you for quick answer. I already convert it with bash.

Another question about trio, we have a autotetraploidy genome which have 120x HiFi + 150x UL ONT (coverage based on the haploid size). Can rukki support the polypoidy mode?

skoren commented 2 years ago

Rukki supports two colors only currently so you can't separate into more than two haplotypes. However, depending on the heterozygosity between the two copies, it's possible they will segregate anyway and you'd end up with a diploid-like graph with two separate diploid copies of each chromosome. I'd suggest running verkko with all your input data and seeing what the graph looks like.

baozg commented 2 years ago

The tetraploid graph is much more complex than diploid. I am runing the pipeline all default except the -k and -w of MBG for faster graph building. We already have a haplotype-resloved one which phased by genetic grouping (using low coverage selfing population). The size of MBG (1.9G) is much smaller than the HiCanu (2.4G) or hifiasm (3.0G). Is any parameters can adjust to increase the graph size?

skoren commented 2 years ago

Is this post-ONT processing or just MBG? Also, all MBG output is homopolymers-compressed so 1.9G would be 3G+ post consensus.

baozg commented 2 years ago

Just MBG, verkko is in the align ONT step.

skoren commented 2 years ago

Yeah so that's homopolymer-compressed so the size is comparable to others. I wouldn't worry about graph complexity now, the ONT makes a very large difference in resolution and will separate some of these components. If you're not already, definitely update to the latest v1.0 release as that uses a faster version of GraphAligner too.

baozg commented 2 years ago

Actually I am using the latest 1.0 from conda. Serveral gaf already finished. Let me wait and see what happen to tetraploidy graph.

skoren commented 2 years ago

The noseq output is automated now, please feel free to open a new issues when your assembly is complete with other questions/issues.

baozg commented 2 years ago

The assembly.homopolymer-compressed.gfa is ~1.7G, the final consensus assembly size is 2.63G with N50 4.5Mb. It did exceed the other assembler with ONT data, but size is smaller than the true size of tetraploid (~3.1G based on flow cytometry and kmer).

skoren commented 2 years ago

The output graph (and assembly) isn't really comparable to hifiasm or hicanu outputs because it is completely phased. So that N50 is pretty good given that, the comparable output would be the utg files from hifiasm. This is likely why you have a slightly smaller size too, there are homozygous regions which need to be duplicated but they're not until you add longer-range phasing information.

So, the next step would be to visualize the graph and see if it looks diploid or tetraploid and try to use your phasing information to see if you can color the nodes appropriately.

baozg commented 2 years ago

N50 of verkko is higher than hifiasm utg N50 (3.08G, 1.45Mb). We did use the genetic grouping to unzip the homozygous region to achieve a more contiguous assembly in our previous assembly. The genetic grouping is a little complex, but I have a colored hifiasm version. If rukki can support the polyploidy, the unitig of verkko output can be more contiguous (unitigs -> contigs)?

skoren commented 2 years ago

I'd suggest coloring the verkko graph and seeing how it looks, it might be simpler than what you have from hifiasm. The graph is in homopolymer-compressed space so make sure whatever you're using for coloring is too.

Rukki needs two colors but verkko can support arbitrary paths (including ones w/gaps) if you can generate them. It might be possible to run rukki multiple times setting the two colors as hap1 and not hap1 first, then hap2, not hap2, and so on and then combining the resulting paths. You can also manually generate some of the paths in the more complex areas if you want and again, verkko can generate consensus for them.

baozg commented 2 years ago

I will color the verkko graph by using previous haplotype-resloved assembly (using diferent haplotype as hapmer).

If I can color the four haplotypes appromly, how to use verkko for next step consensus (haplotype-aware) with maually assigned paths?

Also for tetraploidy background (hap ABCD), if I set the hap1 (hapA-specific kmer), hap2 (BCD), this can only find the haplotype-specific paths. But for some homozygous paths, like 2/3/4 haplotype shared, how can I assigin them?

baozg commented 2 years ago

ont_subset.id is. all the ONT reads for consensus or just seed reads? I only found 487 reads in out tetrploid running.

skoren commented 2 years ago

Those are all the reads needed for consensus when no HiFi reads exist. There are typically not too many of them and varies between species depending on frequency of HiFi dropouts. That doesn't mean only 487 reads were used for repeat resolution.