Open RenzoTale88 opened 4 years ago
Hi Andrea,
There are unfortunately not yet a set of recommendations for mapping RNA-seq data using vg
, but I am currently working on it together with @jeizenga .
Generally, I have had pretty good results with mapping simulated RNA-seq data using both map
and mpmap
with default parameters when transcripts are available as paths in the graph. mpmap
using the min distance clusterer (--min-dist-cluster
) seem to currently give the best results when the paths are not available. Note, that none of the mappers are able to discover novel splice-junctions (they do not have a splice-site model) and known junctions are therefore needed in the graph to get good results. You can add these from a gtf/gff3 and/or bed file using vg rna
.
Best,
Jonas
Hi Jonas,
thank you for your answer. Regardin the vg rna
command: should I apply it to the original graph, prior mapping of the reads? Or should I do that while augmenting the graph downstream?
Also, my graph is derived from the alignments of 5 different assemblies through cactus. 3 out of 5 have annotations. Is it possible to include all of them in the graph, annotating the different paths?
Thanks you very much, Andrea
Are we talking about prior mapping of genomic or RNA-seq reads? vg rna
should be run before mapping of RNA-seq reads. However, it also might be a good idea to run vg rna
before mapping genomic data as the node ids can change when adding splice-junctions. This will ensure that mapped genomic and RNA-seq reads are compatible with the same graph.
There shouldn't be a problem using all 3 annotations as long as assembly (or reference) paths are available for each of the annotations.
Sorry for my late reply. So, I got reads both for DNA and RNA-seq short reads to map to a cactus graph genome. For one of the genomes used, there is also the annotation available on NCBI, that I would like to include in the graph. What I though is to proceed as follow:
vg mod -X 32 mygraph.raw.vg | vg mod -s - > mygraph.vg
vg rna -e -n annot.gff mygraph.vg > mygraph.annot.vg
vg index -x mygraph.annot.xg mygraph.annot.vg
vg map -m mapping mygraph.annot.vg
vg prune -u -a -k 32 -m mapping mygraph.annot.vg > mygraph.annot.pruned.vg
vg index -g mygraph.annot.gcsa mygraph.annot.pruned.vg -p -Z 1536 -t 4 -f mapping mygraph.annot.pruned.vg
Is this approach correct? Thank you very much, Andrea
The pipeline looks correct, except for the line: vg map -m mapping mygraph.annot.vg
. I assume it is a typo and was not supposed to be vg map
?
@jonassibbesen yes, that's a typo. The actual command for mapping is the following:
vg map -R $sample -N $sample -S 0 -u 1 -t $NCORES_MAP -x all.xg -g all.gcsa lib1.R1.fq.gz lib1.R2.fq.gz > ./ALIGN/$sample/lib1.tmp.gam
For mapping RNA-seq data I would recommend using mpmap
with a distance index (-d
). mpmap
have recently been improved for RNA-seq data and generally we see better mapping results using this algorithm for RNA-seq mapping. You can use the -S
option if you want it to write the same mapping output format (gam) as map
. You can generate the distance index using vg index -j
.
@jonassibbesen thank you again for the help. Sorry for the silly question, but the index you are talking about needs to be generated for the reads or the graph?
The distance index is for the graph. In order to create it you first need to create a snarl index:
vg snarls -T graph.xg > snarls.pb
Afterwards you can create the snarl-based distance index using:
vg index -x graph.xg -s snarls.pb -j distance_index.dist
Please let me know if you have any other question regarding vg and rna-seq data.
Hello, I've got some short read RNA-seq data that I'd like to process against a graph genome. How do you recommend to proceed? Is there a wiki guide to follow? Thank you in advance, Andrea