vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

RNA-seq data processing #2621

Open RenzoTale88 opened 4 years ago

RenzoTale88 commented 4 years ago

Hello, I've got some short read RNA-seq data that I'd like to process against a graph genome. How do you recommend to proceed? Is there a wiki guide to follow? Thank you in advance, Andrea

jonassibbesen commented 4 years ago

Hi Andrea,

There are unfortunately not yet a set of recommendations for mapping RNA-seq data using vg, but I am currently working on it together with @jeizenga .

Generally, I have had pretty good results with mapping simulated RNA-seq data using both map and mpmap with default parameters when transcripts are available as paths in the graph. mpmap using the min distance clusterer (--min-dist-cluster) seem to currently give the best results when the paths are not available. Note, that none of the mappers are able to discover novel splice-junctions (they do not have a splice-site model) and known junctions are therefore needed in the graph to get good results. You can add these from a gtf/gff3 and/or bed file using vg rna.

Best,

Jonas

RenzoTale88 commented 4 years ago

Hi Jonas, thank you for your answer. Regardin the vg rna command: should I apply it to the original graph, prior mapping of the reads? Or should I do that while augmenting the graph downstream?

Also, my graph is derived from the alignments of 5 different assemblies through cactus. 3 out of 5 have annotations. Is it possible to include all of them in the graph, annotating the different paths?

Thanks you very much, Andrea

jonassibbesen commented 4 years ago

Are we talking about prior mapping of genomic or RNA-seq reads? vg rna should be run before mapping of RNA-seq reads. However, it also might be a good idea to run vg rna before mapping genomic data as the node ids can change when adding splice-junctions. This will ensure that mapped genomic and RNA-seq reads are compatible with the same graph.

There shouldn't be a problem using all 3 annotations as long as assembly (or reference) paths are available for each of the annotations.

RenzoTale88 commented 4 years ago

Sorry for my late reply. So, I got reads both for DNA and RNA-seq short reads to map to a cactus graph genome. For one of the genomes used, there is also the annotation available on NCBI, that I would like to include in the graph. What I though is to proceed as follow:

vg mod -X 32 mygraph.raw.vg |  vg mod -s - > mygraph.vg
vg rna -e -n annot.gff mygraph.vg > mygraph.annot.vg
vg index -x mygraph.annot.xg mygraph.annot.vg
vg map -m mapping mygraph.annot.vg
vg prune -u -a -k 32 -m mapping mygraph.annot.vg > mygraph.annot.pruned.vg
vg index -g mygraph.annot.gcsa mygraph.annot.pruned.vg -p -Z 1536 -t 4 -f mapping mygraph.annot.pruned.vg

Is this approach correct? Thank you very much, Andrea

jonassibbesen commented 4 years ago

The pipeline looks correct, except for the line: vg map -m mapping mygraph.annot.vg. I assume it is a typo and was not supposed to be vg map?

RenzoTale88 commented 4 years ago

@jonassibbesen yes, that's a typo. The actual command for mapping is the following:

vg map -R $sample -N $sample -S 0 -u 1 -t $NCORES_MAP -x all.xg -g all.gcsa lib1.R1.fq.gz lib1.R2.fq.gz > ./ALIGN/$sample/lib1.tmp.gam
jonassibbesen commented 4 years ago

For mapping RNA-seq data I would recommend using mpmap with a distance index (-d). mpmap have recently been improved for RNA-seq data and generally we see better mapping results using this algorithm for RNA-seq mapping. You can use the -S option if you want it to write the same mapping output format (gam) as map. You can generate the distance index using vg index -j.

RenzoTale88 commented 4 years ago

@jonassibbesen thank you again for the help. Sorry for the silly question, but the index you are talking about needs to be generated for the reads or the graph?

jonassibbesen commented 4 years ago

The distance index is for the graph. In order to create it you first need to create a snarl index:

vg snarls -T graph.xg > snarls.pb

Afterwards you can create the snarl-based distance index using:

vg index -x graph.xg -s snarls.pb -j distance_index.dist

Please let me know if you have any other question regarding vg and rna-seq data.