pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 40 forks source link

What is the recommended way to extract haplotype from the graph based on VCF traversal path? #439

Open Griffan opened 2 years ago

Griffan commented 2 years ago

Hi,

Thanks for this awesome tool to enable graph file manipulation!

I wonder if there is an easy way to extract haplotypes based on specific traversal paths? I understand that odgi can extract subgraphs based on genomic coordinates and node list(which is only one node per line and leads to fragmented haplotypes in the output). Is there a possible to extract the haplotypes of each sample in the graph based on a traversal path, e.g. "321>322>324>325>326" and "321>322>323" for ref and alt alleles?

ekg commented 2 years ago

I believe there is a node list input option in odgi extract. Just concert the traversal paths to a file with one node ID per line and feed it in.

On Tue, Aug 9, 2022, 06:54 Griffan(Fan Zhang) @.***> wrote:

Hi,

Thanks for this awesome tool to enable graph file manipulation!

I wonder if there is an easy way to extract haplotypes based on specific traversal paths? I understand that odgi can extract subgraphs based on genomic coordinates and node list(which is only one node per line and leads to fragmented haplotypes in the output). Is there a possible to extract the haplotypes of each sample in the graph based on a traversal path, e.g. "321>322>324>325>326" and "321>322>323" for ref and alt alleles?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/439, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPNA3EEWRTLM2DEVNTVYHP7JANCNFSM557LPLGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Griffan commented 2 years ago

Thanks for your reply, but as you and I mentioned above, this node list option only accepts one node ID per line. If the event is really large (let's not consider the multi-alleles shared some nodes for now), this would lead to a output that contains: 1) a huge flattened matrix contains N(samples) by M(nodes) entries of haplotype fragments for each path 2) the N dimension is dynamic if there are any samples doesn't pass through any one of the nodes(which is not ideal because only the haps of the samples that traverse the entire path are of interest) 3) the process to concatenate the haps for each sample from this matrix is basically to reconstruct the local graph structure, then why don't we output the haps using the "321>322>323" in the first place, which is exactly the benefit of maintaining this graph structure

Please correct me if I did not describe the question clearly or misunderstood the corresponding tutorial description.

Thanks!

subwaystation commented 2 years ago

Just to clarify: You feed a list of node identifiers into odgi and you only want those paths returned, that follow those nodes in the exact order you specified? Else we don't report anything. @Griffan