pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
364 stars 40 forks source link

How to extract the non-backbone reference sequences from pggb graph? #186

Open biozzq opened 2 years ago

biozzq commented 2 years ago

Dear all,

We already have constructed a graph from six genomes (five newly assemblied genomes plus one backone reference). I want to focus on the non-backone references added in the graph. Can I use odgi to achieve this? Thank you in advance.

Best, Zheng zhuqing

ekg commented 2 years ago

What do you mean by backbone? Non-core?

You can use odgi depth to get a collection of BED intervals that are at <6x depth. Then with this you can run odgi extract. It will make a fragmented graph. Perhaps it'd be easiest to use that as input to odgi pav to tabulate presence/absence variation.

This workflow does need a deduplication step for the PAV BED. I'll be working on this next.

On Sun, Apr 17, 2022, 04:39 biozzq @.***> wrote:

Dear all,

We already have constructed a graph from six genomes (five newly assemblied genomes plus one backone reference). I want to focus on the non-backone references added in the graph. Can I use odgi to achieve this? Thank you in advance.

Best, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/186, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPO7S3YG4XR5PJGQDTVFN2UZANCNFSM5TTGODNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

biozzq commented 2 years ago

Dear @ekg

Sorry for the not clear expression, the backbone means the linear reference genome that used to maintain the conceptual “linearity” of input genomes. I want to extract all novel sequences relative to the backbone. Do you have any idea on this?Thank you.

Best regards, Zheng zhuqing

ekg commented 2 years ago

There is no backbone in pggb graphs. All of the input genomes are simultaneously used to order and organize the graph.

The input sequences are all aligned directly to all others. The graph is built from this symmetric comparison.

Is there something in the documentation that suggests there is a backbone? We should adjust if so.

As an explanation: it seems problematic to me to limit a pangenome graph to one reference genome. Then you have to make a new graph for every reference genome you want to use. That can get messy, and so we specifically avoid this.

On Mon, Apr 18, 2022, 16:30 biozzq @.***> wrote:

Dear @ekg https://github.com/ekg

Sorry for the not clear expression, the backbone means the linear reference genome that used to maintain the conceptual “linearity” of input genomes. I want to extract all novel sequences relative to the backbone. Do you have any idea on this?Thank you.

Best regards, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/186#issuecomment-1101454237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEOYXID4P74ISFEO5EDVFVWX7ANCNFSM5TTGODNA . You are receiving this because you were mentioned.Message ID: @.***>

biozzq commented 2 years ago

Dear @ekg

I do not have any documentation that suggests we need a backbone when generating a graph genome. However, when analyzing structural variations, I want to classify them as deletion, insertion, inversion or others, I find it difficult because I cannot select one reference genome to provide a persistent structure against. How do you think about this?

Best regards, Zheng zhuqing

ekg commented 2 years ago

I understand! Sorry for the confusion. You simply need to pick a reference for this polarization. You specify this reference when generating a VCF from the graph using VG deconstruct. This will provide a reference relative description of the variation. The same graph can be used for multiple VCFs against different references.

I suggest applying this pipeline to parse complex structural variant calls into primitive alleles (e.g. indels and SNPs):

vcfbub -a 100000 deconstruct.vcf | vcfallelicprimitives >decomposed.vcf

The resulting output will generally be simple SNPs and indels up to ~100kb. Note that vg deconstruct generates nested sites, and vcfbub uses the nesting information in the VCF to "pop" bubbles (variant sites) bigger than it's parameters dictate.

We will probably make this postprocessing standard in pggb, pending a few updates to vcflib. One update will integrate BiWFA which will allow decomposition of extremely large variants (1Mbp and up) in low memory.

On Thu, Apr 21, 2022, 09:15 biozzq @.***> wrote:

Dear @ekg https://github.com/ekg

I do not have any documentation that suggests we need a backbone when generating a graph genome. However, when analyzing structural variations, I want to classify them as deletion, insertion, inversion or others, I find it difficult because I cannot select one reference genome to provide a persistent structure against. How do you think about this?

Best regards, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/186#issuecomment-1104804406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPH64DNQBWYMLMGS5DVGD6CNANCNFSM5TTGODNA . You are receiving this because you were mentioned.Message ID: @.***>