vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 195 forks source link

Aligning pan-transcriptome to pangenome graph #3173

Closed brettChapman closed 3 years ago

brettChapman commented 3 years ago

Hi

I'm currently constructing a pangenome from 20 varieties of the same species. The pan-transcriptome is also being generated, with a transcriptome for each individual across different tissues. Referring to https://github.com/vgteam/vg/issues/35 I'm wondering if it's possible to convert a pangenome graph of 20 haplotypes into a sliced graph (based on GFF from each individual), and then align the raw RNA-seq data to the pangenome-spliced graph. I'm producing my pangenome graph using PGGB (https://github.com/pangenome/pggb). Thanks.

ekg commented 3 years ago

If you can hack pggb to add a representation of the reference transcriptome into the graph, then you should get something compatible with vg rna. There might be difficulties using it but nothing fundamental should block this approach.

It'd of course be good to add all transcripts from all 20 genomes in, if you have that.

On Fri, Jan 22, 2021, 05:35 Brett Chapman notifications@github.com wrote:

Hi

I'm currently constructing a pangenome from 20 varieties of the same species. The pan-transcriptome is also being generated, with a transcriptome for each individual across different tissues. Referring to

35 https://github.com/vgteam/vg/issues/35 I'm wondering if it's

possible to convert a pangenome graph of 20 haplotypes into a sliced graph (based on GFF from each individual), and then align the raw RNA-seq data to the pangenome-spliced graph. I'm producing my pangenome graph using PGGB ( https://github.com/pangenome/pggb). Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJWEZ4U2MQTF44S66LS3D6B7ANCNFSM4WN5MGXA .

jonassibbesen commented 3 years ago

Hi, I have been working on a wiki for transcriptomic analyses using vg, which you might find useful: https://github.com/vgteam/vg/wiki/Transcriptomic-analyses It is still missing a lot, but contains some information on how to run vg rna and which mapper to use for RNA-seq data. The major thing that it is still missing is how to use vg rna to project transcripts between haplotype paths, but since you already have an annotation for each strain that feature is not that important for you.

brettChapman commented 3 years ago

Thanks for your feedback @ekg, and the very helpful wiki you have provided @jonassibbesen.

We're currently acquiring the RNA-seq data for each of the 20. At the moment we have a GFF for each of the 20 genomes, but it's currently mostly de novo and only supported from RNA-seq of 3 of the 20 genomes. The plan is to perform gene prediction and annotation for each genome. I'll then be able to prepare the graph to use vg rna.

@ekg in regards to hacking pggb, do you mean add in a few lines of code in the pggb script for vg convert or do you mean modify how the smooth.gfa file is generated, such as removing consensus graphs, if you think vg rna wouldn't work with them present? From the wiki @jonassibbesen has provided, it sounds like it may conflict if consensus paths are in the graph output, but not present in the GFF file. If that's the case, when the time comes, to get around this I'll just fork pggb, and remove the consensus graph parameters of smoothxg, and resume the run from the smoothxg step.

@ekg It might be useful to add an option to pggb, to either add or leave out the consensus paths, especially if the consensus is not needed, or to provide a smoothed graph with and without the consensus paths, to provide choices for any downstream analysis.

ekg commented 3 years ago

We might want to have a graph that is consensus paths plus transcript paths. Does that match the needs of vg rna?

Also I meant adding a few lines to change the way the alignments are made initially. The GFF file would need to be converted to alignments plus FASTA representing the transcripts. Or, maybe better, you could take the seqwish graph and use tools in vg to convert the GFF to embedded paths. Then, this would go forward into the smoothing.

On Mon, Jan 25, 2021, 04:20 Brett Chapman notifications@github.com wrote:

Thanks for your feedback @ekg https://github.com/ekg, and the very helpful wiki you have provided @jonassibbesen https://github.com/jonassibbesen.

We're currently acquiring the RNA-seq data for each of the 20. At the moment we have a GFF for each of the 20 genomes, but it's currently mostly de novo and only supported from RNA-seq of 3 of the 20 genomes. The plan is to perform gene prediction and annotation for each genome. I'll then be able to prepare the graph to use vg rna.

@ekg https://github.com/ekg in regards to hacking pggb, do you mean add in a few lines of code in the pggb script for vg convert or do you mean modify how the smooth.gfa file is generated, such as removing consensus graphs, if you think vg rna wouldn't work with them present? From the wiki @jonassibbesen https://github.com/jonassibbesen has provided, it sounds like it may conflict if consensus paths are in the graph output, but not present in the GFF file. If that's the case, when the time comes, to get around this I'll just fork pggb, and remove the consensus graph parameters of smoothxg, and resume the run from the smoothxg step.

@ekg https://github.com/ekg It might be useful to add an option to pggb, to either add or leave out the consensus paths, especially if the consensus is not needed, or to provide a smoothed graph with and without the consensus paths, to provide choices for any downstream analysis.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3173#issuecomment-766513716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEJ6PN6Z7VVAMPZ4D5LS3TPP3ANCNFSM4WN5MGXA .

jonassibbesen commented 3 years ago

Glad you found the wiki helpful.

It is not a problem for vg rna if there is additional paths in the graph that are not in the annotation GFF files also, such as the consensus. What is required is that all chromosomes (column 1) specified in the GFF files are present as paths in the graph. All existing paths will still be there after vg rna and their nucleotide sequence should be unchanged (node sequence will likely be different). The only exception to this is if the option --remove-non-gene is used, which tells vg rna to create a splice graph without the intergenic and intronic regions. Also, if you want to add the transcripts in the annotation as embedded paths to the graph you can do this with the option --add-ref-paths.

brettChapman commented 3 years ago

Thanks @ekg and @jonassibbesen.

Embedding the graph from Seqwish with transcripts using vg rna and then normalising with smoothxg sounds like the better approach.

jeizenga commented 3 years ago

It looks like this question was largely resolved. I'm going to close the issue, but let me know if you'd like it re-opened.

brettChapman commented 3 years ago

Hi @jonassibbesen

I'm revisiting this issue again. I now have splice regions from all my genomes based on some RNA-seq analysis I have carried out. I merged all high confidence transcripts into a GTF file using stringtie (using the --conservative flag and merging with the original annotations), and then I'll merge all the GTF files amending the chromosome name to match the paths in my pangenome graph.

For my pangenome graph I currently have multiple GFA files for each chromosome (I generated them separately due to memory overhead limits). I imagine it's recommended to generate a single pangenome graph prior to generating a splice graph to align my RNA-seq data to.

I'm in the process of generating a single GBWT file from a single indexed graph from multiple VG graphs (vg ids -j, followed by vg index), for the purpose to align low and high coverage reads I have for variant calling. I could use the merged single GBWT file for vg rna as well. Is it recommended to generate the PackedGraph from the GBWT file for use with vg rna, using

vg convert -b -p graph.gbwt > graph.pg

or should I first try and combine all my graphs into a single GFA prior to generating a packed graph?

I notice vg rna takes a haplotype GBWT file with the -l parameter and I thought If I generate the packed graph from the GBWT file and supply the GBWT file to vg rna with -l, it might not be the most ideal way to go about running vg rna.

Thanks.

jeizenga commented 3 years ago

Do you mean an XG rather than a GBWT? The GBWT is an index that stores a collection of walks through the graph (often haplotypes), whereas the XG is a memory-efficient representation of the graph itself. GBWTs are usually constructed from phased variant calls or from a GFA with W lines.

The GBWT doesn't store the DNA sequences of nodes. However, there's another data structure, the GBWTGraph, that will store node sequences. A paired GBWTGraph and GBWT can be used as a graph in vg convert.

brettChapman commented 3 years ago

No I mean GBWT. From my 7 VG (generated from 7 GFA), I did the following (the barley genome has 7 chromosomes):

vg ids -j $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.vg; done)
vg index -x barley_pangenome_graph.xg $(for i in $(seq 1 7); do echo barley_pangenome_graph_${i}H.vg; done)
vg gbwt -x barley_pangenome_graph.xg --buffer-size 1000 --index-paths -o barley_pangenome_graph.gbwt

My GBWT is still being generated. It's taken a few weeks and is still running.

As far as I can tell there is no way to generate a GFA from an XG using vg convert.

How would I generate a GBWTGraph from my single large XG. Thanks.

jeizenga commented 3 years ago

To generate a GFA from an XG:

vg convert -f graph.xg > graph.gfa

To make a GBWT graph from a GBWT and XG/VG/PackedGraph:

vg gbwt -x graph.[xpv]g -g graph.gg haplotypes.gbwt

Are you aware that the --index-paths option will only include the embedded paths in the GBWT? In most graphs, that's only the reference path, in which case the GBWTGraph and GBWT will essentially be just the primary reference scaffolds. If you are using vg rna, it could also include transcript paths.

Maybe @jltsiren would have some ideas about how to speed up the GBWT construction.

Hope that helps!

brettChapman commented 3 years ago

Thanks @jeizenga I wasn't aware that vg convert could take XG as input. There is nothing in the usage which indicates this.

Yes. My graph contains all the haplotypes I'm interested in aligning to. This is in relation to a previous issue I raised: https://github.com/vgteam/vg/issues/3303 where I want to align genomic reads from other barley varieties for variant calling.

Since I'm also interested in aligning RNA-seq reads for RNA-seq quantitation, I'll also generate a spliced graph using vg rna, given multiple GTF files which I've previously prepared.