pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 39 forks source link

Could we use odgi to remove small variant paths #381

Closed biozzq closed 1 year ago

biozzq commented 2 years ago

Hi all,

When having a large of genomes, in my mind, compromises should be made to achieve efficient computational and performance, such as removing the small variants emboded in the graph genome. How do you think that? thank you in advance.

sincerely Zheng zhuqing

AndreaGuarracino commented 2 years ago

I would agree. Pangenome graphs have the power to fully represent input variation, and this can become a limitation, particularly if one is only interested in, for example, medium/large variants. One work in this direction is the generation of consensus graphs in pggb, types of graphs with different levels of resolution. Moreover, there are commands in odgi, such as prune and break (or combinations of commands, like depth and extract to remove regions with very high depth), which go in the direction of simplifying graph topology.

I ping @ekg for more broad and in-depth insight into this topic.

biozzq commented 2 years ago

Dear @AndreaGuarracino

Thank you. When using ogdi to clean the graph, I could not find the option about the variant size.

Yes, I can filter the nodes with high depth (here set to 100), however, after pruning using odgi, I found that the depth of the pruned graph are all 0 but not for the nodes in the original graph (using following commands).

odgi prune -i pggb.og -C 100 -o pggb.prune.og
odgi depth -i pggb.og -d > pggb.node.depth
odgi depth -i pggb.prune.og -d > pggb.prune.node.depth
head pggb.node.depth
#node.id    depth   depth.uniq
1   1   1
2   1   1
3   1   1
4   1   1
5   1   1
...
head pggb.prune.node.depth
#node.id    depth   depth.uniq
1   0   0
2   0   0
3   0   0
4   0   0
5   0   0

Best regards, Zheng zhuqing

ekg commented 2 years ago

To keep path fragments, use odgi extract. Perhaps we should make the limitations of odgi prune more clear, or update it to use the same approach as in extract.

You'll want to use odgi depth to collect regions in your desired size range.

odgi depth -i graph.og -w 0:100:0 >good.bed
odgi extract -i graph.og -b good.bed -o pruned.og

The -w parameter takes a tuple of min / max / separation length. It'll iterate over each path, finding all ranges where the depth of paths in the graph is within the given boundaries. If two regions in the boundaries are within 0bp, they'll be merged. In other words, this will give exactly all ranges that are within the depth boundaries.

You can also invert the process, using odgi depth -W to get regions outside of the target depth. Then, we could use odgi extract -I to get the inverse (the parts not outside the target depth range). You may want to test various approaches and thresholds. These are the current best pruning approaches in odgi. We've used them to trim out centromeres from human graphs we built.

ekg commented 2 years ago

For removing small variants, first: we're working on a bubble detection method that is also path-depth based. This should address some limitations with existing approaches, and be more flexible for variant detection. In theory, this could be used to drive pruning.

Currently, the odgi extract pruning will break paths, adding many fragmentary ones as small variants are removed. Unfortunately, resolving this rigorously will take some effort. We'd like to have a kind of tuning parameter (call it epsilon) that, when set to 0 would imply a lossless graph, but when set to e.g. 0.5 would replace variants with MAF < 0.5 with the major allele, fixing up paths and such.

You could achieve something that at least removed the low-frequency short sequences by using odgi depth -w to target rare variants, and then keeping only ranges shorter than a given length (e.g. 1bp). Finally, you'd apply this in odgi extract to execute the pruning. I'm really curious how this works for you. It sounds like it's approximately what you want.

biozzq commented 2 years ago

Dear @ekg

Thank you for your suggestions.

odgi depth -i graph.og -w 0:100:0 >good.bed
odgi extract -i graph.og -b good.bed -o pruned.og

Minor correction, here, the -w should be 0:0:100.

You could achieve something that at least removed the low-frequency short sequences by using odgi depth -w to target rare
variants, and then keeping only ranges shorter than a given length (e.g. 1bp). Finally, you'd apply this in odgi extract to
execute the pruning. I'm really curious how this works for you. It sounds like it's approximately what you want.

Here, i want to confirm with you. If we set -w to 10,0,100, it will merge the paths separated by no more than 10 bp, that is, it will remove the nodes with less than 10 nt. Is this right?

Best regards, Zheng zhuqing

biozzq commented 2 years ago

Dear @ekg @AndreaGuarracino

I found that if we used odgi extract to remove the paths with very high depth, the backbone reference genome will be broken into small fragments (like followings). You can see that the chromosome 6 has been divided into a number of small paths. I wonder that if any parameters can be used to maintain the integrity and continuity of the backbone genome?

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

Best regards, Zheng zhuqing

ekg commented 2 years ago

To fix this, specify paths to fully "lace" back into the graph using the -R --lace-paths option. You'll give that a file which has one path per line. These will be fully included in the output graph.

On Thu, Mar 31, 2022, 10:22 biozzq @.***> wrote:

Dear @ekg https://github.com/ekg @AndreaGuarracino https://github.com/AndreaGuarracino

I found that if we used odgi extract to remove the paths with very high depth, the backbone reference genome will be broken into small fragments (like followings). You can see that the chromosome 6 has been divided into a number of small paths. I wonder that if any parameters can be used to maintain the integrity and continuity of the backbone genome?

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

Best regards, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/381#issuecomment-1084254790, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENRFG2ASWRYQXS7VDLVCVOFBANCNFSM5RS35H6Q . You are receiving this because you were mentioned.Message ID: @.***>

AndreaGuarracino commented 2 years ago

@biozzq take a look at the Remove artifacts and complex regions tutorial. In particular, this step shows how to use the -R/--lace-paths option to preserve a set of paths in the extracted subgraph.

biozzq commented 2 years ago

Dear all,

Thank you. I will have a try. I would like to follow up with you regarding my previous issues.

Here, i want to confirm with you. If we set -w to 10,0,100, it will merge the paths separated by no more than 10 bp, that is, it >will remove the nodes with less than 10 nt. Is this right?

Sincerely, Zheng zhuqing

AndreaGuarracino commented 2 years ago

odgi depth ... -w to 10,0,100 will emit regions where the depth is between 0 and 100, but if two regions are separated by less than 10 bases, those regions will be merged. This means that in the emitted intervals, small parts with a depth greater than 100 may be included.

biozzq commented 2 years ago

Dear @AndreaGuarracino ,

Thank you. So, my understanding about -w to 10,0,100 is wrong. If we want to completely remove the regions (can also be called paths) where the depth higher than 100, we must set -w to 0,0,100, is this right?

Sorry, if back to the beginning of the question, I still don't know how to remove small variants using odgi.

Sincerely. Zheng zhuqing

AndreaGuarracino commented 2 years ago

Yes, with -w to 0,0,100 your ranges will strictly contain regions with depth between 0 and 100. You can use these regions to extract a sub-graph that respects the 0-100 depth range. Note that you will not change the embedded sequences themselves. When you say you want to remove small variants, if what you want to do is replace the ALT allele with the REF allele, that is not what will happen.

biozzq commented 2 years ago

Thanks, Yes, we may need change the allele sequences in the graph. So, do you mean that we could not use odgi to remove small variants?

ekg commented 2 years ago

It may not be ideal, but you could use the frequency filtering we are talking about and then realign the sequences to the simplified graph using GraphAligner with the corrected-out mode.

There is not a method in odgi to use the graph bubble information to change sequences of paths. Do you want to do sequence correction?

On Thu, Mar 31, 2022, 17:17 biozzq @.***> wrote:

Thanks, Yes, we may need change the allele sequences in the graph. So, do you mean that we could not use odgi to remove small variants?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/381#issuecomment-1084735295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEL5LZYBDJY33T6I7ILVCW6XZANCNFSM5RS35H6Q . You are receiving this because you were mentioned.Message ID: @.***>

biozzq commented 2 years ago

Sorry for taking a while to get back here.

It may not be ideal, but you could use the frequency filtering we are talking about and then realign the sequences to the simplified graph using GraphAligner with the corrected-out mode.

I cannot keep up with your idea as I do not know the association between frequency filtering and small variants filtering. I do not think that if I have filtered out low frequency variants, small variants will also be filtered.

There is not a method in odgi to use the graph bubble information to change sequences of paths. Do you want to do sequence correction?

No, I do not want to do sequence correction. I just want to remove the small variants, such as the variants smaller than 50bp.

More, I found that after running pggb pipeline, a VCF file named *smooth.fix.*.vcf will be generated after setting --vcf-spec. I wonder that could I do some filtering in this VCF file before converting it to a GFA file? Thank you in advance.

Best regards, Zheng zhuqing

subwaystation commented 2 years ago

I think what @biozzq wants is to remove nodes with a size of 50bp or smaller?

You could filter the VCF. Will you use vg construct to go from VCF to GFA? If so, be prepared that you might loose some information compared to the original GFA, because the VCF is reference-centric!

Best, Simon

ekg commented 2 years ago

Yes, you can just remove variants smaller than 50bp from the VCF.

Note that you will get a VCF with nested sites represented as a tree of bubbles, with the LV field and PS field explaining the hierarchy. To make a GFA from this I would suggest applying vcfbub to remove extremely large sites and then the filtering you describe to remove small ones. You can also keep only LV=0 sites and do this to avoid duplication.

The filtering idea is different on the graph than in VCF space. Hope this makes sense for your application.

On Sat, Apr 9, 2022, 06:38 biozzq @.***> wrote:

Sorry for taking a while to get back here.

It may not be ideal, but you could use the frequency filtering we are talking about and then realign the sequences to the simplified graph using GraphAligner with the corrected-out mode.

I cannot keep up with your idea as I do not know the association between frequency filtering and small variants filtering. I do not think that if I have filtered out low frequency variants, small variants will also be filtered.

There is not a method in odgi to use the graph bubble information to change sequences of paths. Do you want to do sequence correction?

No, I do not want to do sequence correction. I just want to remove the small variants, such as the variants smaller than 50bp.

More, I found that after running pggb pipeline, a VCF file named smooth.fix..vcf will be generated after setting --vcf-spec. I wonder that could I do some filtering in this VCF file before converting it to a GFA file? Thank you in advance.

Best regards, Zheng zhuqing

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/381#issuecomment-1093682364, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELQCFQQIMI5PZKUIALVEECVNANCNFSM5RS35H6Q . You are receiving this because you were mentioned.Message ID: @.***>

subwaystation commented 1 year ago

Looks like this is not an issue anymore.