Closed inti closed 8 years ago
@inti YES!!!! vg should in theory be able to all of these things. However, the intron as indel interpretation issue is somewhat complicated. I need to think about that. (perhaps you could elaborate?)
You can start testing by building a graph with a few transcripts for a gene. However, there is a blocking issue: #39.
Ideally, we should align assembled trancripts. This would get us the "indel" corresponding to the intron splicing events. However, it may be more feasible to align just the exons from each transcript, using the transcript name as the alignment name. Then, we'd have to modify the graph to include the introns that are excised. I'm not sure at present which way will work better. Do you have test data to work with?
Hi @ekg , So what I had in mind was: align to the reference to identify transcript structures, then use vg to refine the alignments. In that way the introns would be known. One option would be to have a reference + known variation + known transcripts and build the variation graph and annotate the transcripts paths on the graph or at least annotate the exon-exon boundaries and the transcripts they belong to. In that way when aligning RNA-seq data one could exploit those paths explicitly. I can see your suggestion would be doing this as well. Several aligners, like GSNAP, would take a exon-intro map and used that for the spliced alignments. Would it be easier to use this strategy? Either use this information during mapping or simply take the exon-intro map and edit the graph file to add the corresponding paths?
I think a test could be to takes some mRNA-seq data of the human genome, potentially for one of the 1KG samples, and testing and check whether we can find the known variation. Makes sense to you? I do not have that data at hand but for the sake of trying I can find it and send you small bam files.
Several aligners, like GSNAP, would take a exon-intro map and used that for the spliced alignments. Would it be easier to use this strategy? Either use this information during mapping or simply take the exon-intro map and edit the graph file to add the corresponding paths?
I really like this approach. It's not completely general and will require more coding to enable (ideally, we just align assembled transcripts to build up the RNA+DNA variation graph), but it will work cleanly and the input is well-curated.
To get both variation and transcripts, all that needs to be done is to modify the 1000G+GRCh37 graph to include the intronic deletions that we see in the RNA-seq data. Then we'd just need to clean up the graph, ensuring partial ordering of ids and such (vg mod -c
), and re-index to enable alignment.
Code will need to be written to do the inclusion. It would look like this.
for each transcript:
make an alignment to represent the transcript
for each intron:
find the node(s) at the particular path positions of the start and end of the intron
(uses vg::Index::get_path)
append a vg::Mapping to cut the nodes and introduce an edge for the intron skip
to the alignment
emit the alignment
use vg mod to include the alignments in the graph
So the basic idea is that we'd use the exon/intron map to build up alignments (vg::Alignment) whose paths are the transcript alignment and whose names are the transcript names. Then we can use an existing tool vg mod -i
to add the alignments to the graph.
vg::Index::get_path
is here: https://github.com/ekg/vg/blob/master/index.hpp#L188 and https://github.com/ekg/vg/blob/master/index.cpp#L1371-L1389
What kind of format are the exon-intro maps in?
Regarding the format of the exon-intro maps. it depends on the software (as always?). GNSAP does it like this, from the README http://research-pub.gene.com/gmap/src/README
On the other hand, to perform known intron-level splicing, you will need
to create a file with the following format:
>NM_004448.ERBB2.intron1 17:35110090..35116769
>NM_004448.ERBB2.intron2 17:35116920..35118100
>NM_004449.ERG.intron1 21:38955452..38878739
>NM_004449.ERG.intron2 21:38878638..38869541
Again, coordinates are 1-based, and specify the exon coordinates
surrounding the intron, with the first coordinate being from the donor
exon and the second one being from the acceptor exon.
There are several ways to help you generate these files. First, if
you have a GTF file, you can use the included programs gtf_splicesites
and gtf_introns like this:
cat <gtf file> | gtf_splicesites > foo.splicesites
cat <gtf file> | gtf_introns > foo.introns
Second, if you retrieve an alignment tracks from UCSC, like this:
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/refGene.txt.gz
if you are aligning to genome hg18, or
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
if you are aligning to genome hg19, you can process this track using
the included program psl_splicesites or psl_introns, like this:
gunzip -c refGene.txt.gz | psl_splicesites -s 1 > foo.splicesites
gunzip -c refGene.txt.gz | psl_introns -s 1 > foo.introns
Note that alignment tracks in UCSC sometimes have an extra column on
the left. The "-s" flag allows you to indicate how many columns
should be skipped.
the gtf_splicesites and gtf_introns come bundled with GSNAP and they come in handy to go from GTF/GFF3 annotations to into-exon maps.
This approach could be flexible since people can either rely on known annotations or first map the RNA-seq data to the genome identify likely variation and transcript structures and then re-map/refine the variation/bam files using variation graphs.
Is it possible to modify a variation graph with a VCF file? would it be easier to take a GTF/GFF file and convert it into a mock VCF file to code the introns as just another variant, perhaps with a special label like NM_004448.ERBB2.intron1 (as example above)?
Is it possible to modify a variation graph with a VCF file? would it be easier to take a GTF/GFF file and convert it into a mock VCF file to code the introns as just another variant, perhaps with a special label like NM_004448.ERBB2.intron1 (as example above)?
It would be possible, but again there needs to be some coding to make this happen. The simplest way is to convert the variants in the file into reads of the novel haplotype represented by the variant, then map these into the existing graph.
One thing that's lacking in vg now is naming mechanisms. Read and sequence names need to be able to be used to annotate paths in the graph. You could actually script out the above solution using the vg command line API if this were the case. So, keep an eye on #39.
Have you had a chance to test this out further?
No yet. I got busy with something else. I am expecting to come back to this within 2-3 weeks. helps?
On Oct 7, 2015, at 12:41, Erik Garrison notifications@github.com wrote:
Have you had a chance to test this out further?
— Reply to this email directly or view it on GitHub https://github.com/ekg/vg/issues/35#issuecomment-146237976.
With banded/split read mapping and the graph editing stuff stabilizing, I'd say it's the perfect time to come back to this. --- These enable the alignment of long contigs (such as transcript sequences) to the reference, and then the editing of the reference with them. Also, I think I've got GCSA indexing working for the whole human genome in <200G of RAM, which is critical for making a usable high-performance mapper. Fingers crossed.
Ping me if you come back to this and have any questions.
We've done a couple proof of concept things with RNA graphs, and I think we're definitely ready for someone to do RNA-seq with vg as an actual project. I'm going to close this.
Hi,
I would like to follow-up on this interesting thread. While I understand that pan-transcriptome and pan-genome indices can be build and used right away, I was wondering how splice alignment in spirit of HiSAT2 could be performed. Any feedback is highly appreciated.
If the splicing is represented in the graph then we should have no problem aligning directly against them. Alternative splicings look the same as the reference from the perspective of the aligner.
To detect novel splicing we could adapt the banded aligner or adjust the scoring function that is used during clustering in single ended read alignment.
I am happy to work on this in the near future. I expect there may be some problems because we have not attempted this before, but in principle the algorithms in vg should have no problem aligning reads to RNAseq graphs.
Further, we can align reads to RNAseq assemblies. This would be interesting in that the reference would then represent the full data set down to some noise threshold set in the assembler. Do others think that would be a useful thing to explore?
On Wed, Mar 22, 2017, 9:58 AM CDieterich notifications@github.com wrote:
Hi,
I would like to follow-up on this interesting thread. While I understand that pan-transcriptome and pan-genome indices can be build and used right away, I was wondering how splice alignment in spirit of HiSAT2 could be performed. Any feedback is highly appreciated.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/35#issuecomment-288350371, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EcpHf7vvzsT7jEcJyplzEvxpqxfRks5roPDWgaJpZM4EVO3k .
Ok, first part sounds great (i.e. incorporating known splice sites).
The second part is even more interesting (i.e. finding novel splice or even back.splice sites)
I am very happy to see your positive reply. Could you pinpoint me towards possible options for the alignment module ?
You can start testing by building a graph with a few transcripts for a gene.
Hi @ekg, I tried to build a graph for a gene with different transcripts. But what I got was separate linear reference for each transcript. Could you help me with that?
vg construct -r gene.fa > tmp/x.vg
What I've got:
Looking forward to your reply. Best
Hi @btrspg, since this issue was closed we've developed methods for constructing variation graphs with splice edges in vg rna
. I would recommend checking that out to start. The vg mpmap
mapping tool also has some specialized features for mapping RNA-seq data, which is discussed here.
@jeizenga Appreciate.
Hi, @jeizenga, I still have some questions about vg rna
. I used a fake fasta and a fake gtf file to construct the graph, but I don's know if I have already successful done the construction of graph.
I manipulated like this:
vg construct -r fake.fasta > fake.vg
vg rna -n fake.gtf --use-embeded-paths fake.vg > fake_rna.vg
Then I used sequenceTubeMap to see is there any new path in the graph but what I got was a linear reference.
If I recall correctly, sequenceTubeMap is designed primarily for genomic variation graphs with phased haplotypes. There are some pitfalls that might be catching you up on a splice graph. I would recommend vg view
and dot
instead. It's not as pretty, but it works well for debugging and troubleshooting. Documentation is here:
https://github.com/vgteam/vg/wiki/Visualization
Hi, @btrspg, transcript paths are not added to the graph by default in vg rna
. You can add them by using the option -r
for reference transcript paths. You can also use -a
for adding haplotype-specific transcript paths if you have embedded non-reference paths in your graph or if you supplied at set of haplotypes using -l
.
@btrspg If you run vg construct
with FASTA of sequences, and no VCF, it just creates a graph with one linear node or run of nodes for each sequence. vg construct
is meant to build a graph from a FASTA reference and a VCF (perhaps GRCh38's FASTA and the 1000 Genomes VCF, for example); it doesn't align the sequences in the FASTA together and squish the shared parts down to make a graph.
If you have multiple homologous sequences to make a graph from, you want https://github.com/ekg/seqwish or maybe vg msga
. Both of those methods will align the sequences against each other.
Thank you all @jeizenga @jonassibbesen @adamnovak . In deed, what I need is to use vg to do some RNA-seq work. I have known the reason why I could not see the graph path was that I need to build the index for fake_rna.vg
with a specific parameter -L
(include alt paths in xg)
vg construct -r fake.fasta > fake.vg
vg rna -n fake.gtf --use-embeded-paths fake.vg > fake_rna.vg
vg index -L -T -G fake_rna.gbwt -g fake_rna.gcsa -x fake_rna.xg
Then I could see the different path from different transcripts in sequenceTubeMap.
I used vg view
and dot
as @jeizenga mentioned to check the graph, so I knew the vg rna
did the right thing. Then I figured out the index influenced the graph visualization.
Thank you all again.
Hi, This is great work and I look forwards to try it! I had some questions and I thought it would be better to ask rather than not to.
thanks in advance,
Inti