vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Right method to annotate genes on pangenome graph #4357

Open yeeus opened 1 month ago

yeeus commented 1 month ago

PLEASE DO NOT MAKE SUPPORT REQUESTS HERE

Please the Biostars forum instead:

https://www.biostars.org/new/post/?tag_val=vg

Ok I will post on Biostars later.

Hello dear friends! Thanks for developing vg such a useful and magic tool for pangenome graph. Firstly I need to say I'm fresh to manipulating graphs due to the various formats (e.g. .vg, .xg, .gbz, .gam ...). And now, as a junior, I do need some helps: I have a human pangenome graph with several genomes with a reference genome_a. And I want to see the locations of some interested genes regions in my graph like the Fig. 5d in HPRC publication. Due to the high complexity of these regions like MHC, gene annotations are not reliable for which we can just draw the gene locations from annotations. Therefore, I turned to using graph to get locally detailed and confident gene annotations. At first, I have tried this method (actually this method is following the odgi tutorial):

  1. extract subgraphs with odgi
  2. get the interested gene bed file and inject them to graph
  3. odgi untangle the injected graph to see the locations of genes on each path

However, I found that for genes having CNV, this method seems often inable to capture all gene copies (actually usually just one copy), so I have turned to finding anther useful method. As for now, I intended to:

  1. align interested genes sequence like HLA genes which were extracted from GRCh38.p14 to graph using Graphaligner
  2. using the alignment generated by step 1 to get gene locations on each haplotype of my graph

For step 2, I initially used vg annotate but it seems only work for reference path (#4158). And I used vg surject using command:

vg paths -x graph.vg -L > graph.vg.paths
vg surject -x graph.vg -t 8 -F graph.vg.paths -M -b genes_sequence_To_graph.gam > genes_sequence_To_graph.bam

which have not got results as I write this. Also from #4158, in which the developers suggested:

but if you have the GAF and you have the GFA you can compare the node names that the GAF reads visit against the node names that each GFA path visits, and find the nodes at which each read intersects with each path it touches.

and I think I can also use this, well stupid method, to get the gene locations from the gaf file Graphaligner generated.

Emmm, I don't know whether vg surject I used above can generate correct alignment file containing the gene locations on each path or not. So I want to know anybody can give me some advice for my process and method or any other helpful method. Please!

Best wishes! Thanks!

adamnovak commented 1 month ago

I don't think we have a known good way to get annotations against all the different samples in the graph using vg. Your idea of injecting into the path you have annotations on and then surjecting that sequence to each other path you are interested in, as an alignment, might work OK.

If you actually have assemblies you want annotated, I think we'd probably recommend using the Comparative Annotation Toolkit instead of vg. CAT is designed to annotate new assemblies using alignments and annotations on previous assemblies, and it actually thinks about things like paralogs and ortholog matching and pseudogenization. I'm not sure how well it works on e.g. MHC, but I also wouldn't lean on vg inject and vg surject and the HPRC graphs to get "reliable" annotations for the assemblies.

Maybe @ph09 or @glennhickey can speak to how well CAT's ortholog matchings are likely to agree with the HPRC graph's Minigraph-Cactus alignments?