jonassibbesen / rpvg

Method for inferring path posterior probabilities and abundances from pangenome graph read alignments
MIT License
47 stars 6 forks source link

rpvg

Method for inferring the expression of haplotype-specific transcript paths in a pantranscriptome using RNA-seq reads aligned to a spliced pangenome graph.

For each paired-end RNA-seq read mapped to a vg pangenome graph the probability of it originating from each transcript path is calculated from the mapping quality, mapping score and estimated fragment length distribution. These read-path probabilities are used to calculate posterior probabilities of haplotype combinations and infer transcript abundances.

Installation

Linux binary

A static binary for Linux is available here for the latest release.

Compilation

rpvg requires that CMake (3.10 or higher), protobuf (version 3), HTSlib, Jansson and OpenMP are installed before compilation. Most of these can easily be installed using agt-get on Linux (see Dockerfile) or Homebrew on Mac. To compile rpvg run the following:

  1. git clone --recursive https://github.com/jonassibbesen/rpvg.git
  2. cd rpvg && mkdir build && cd build
  3. cmake ..
  4. make -j <threads>

Compiling rpvg should take 5-10 minutes using 4 threads (-j). rpvg has been successfully built and tested on Linux (CentOS Linux 7 with GCC 8.1.0 and Ubuntu 18.04 with GCC 7.5.0) and Mac (macOS 10.14.6 with Clang 10.0.1 and macOS 13.2.1 with Clang 15.0.7).

Docker container

A Docker container of the latest commit to master is available here. On modern CPUs using the Docker container might be slower than compiling rpvg since it is not taking advantage of newer instructions in order for the container to be more compatible.

Running rpvg

rpvg requires the following five arguments:

rpvg -g graph.xg -p paths.gbwt -a alignments.gamp -o rpvg_results -i <inference-model>

The prefix used for all output files are given using -o. The number of threads can be given using -t.

An example dataset containing a small pantranscriptome and 100,000 read pairs is available under example. To infer the expression of the 36,120 haplotype-specific transcripts in the pantranscriptome using 4 threads use the following command within the example folder:

../bin/rpvg -t 4 -g graph.xg -p pantranscriptome.gbwt -f pantranscriptome.txt.gz -a mpmap_align.gamp -o rpvg --inference-model haplotype-transcripts

This should take less than a minute to run and will create two files:

Paths:

The pantranscriptome paths should be compressed and indexed using the GBWT. For transcriptome analyses a GBWT with transcript paths can be created using the vg rna subcommand in vg. See the Transcriptomic analyses wiki on the vg github for more information on how to use vg rna.

To decrease the computation time of rpvg it is recommended that a r-index of the paths is supplied together with the GBWT index. The vg gbwt subcommand in vg can be used to construct the r-index from a GBWT index (see the VG GBWT Subcommand wiki on the vg github). The name of the r-index should be the same as the GBWT index with an added .ri extension (e.g. paths.gbwt.ri).

Inference models:

rpvg currently contains four different inference models. Each model have been written with a particular path type and corresponding inference problem in mind:

Alignment types:

rpvg supports mapped reads (both primary and multi-mapping alignments) in the vg alignment format (.gam) and the vg mpmap multipath alignment format (.gmap). Using multipath alignment from vg mpmap is recommended.

Note that rpvg assumes that the default scoring parameters were used for the alignment using either vg map or vg mpmap.

Fragment length distribution:

The fragment length distribution parameters are learned by rpvg. However, in order to learn this the maximum expected fragment length is needed. This is calculated from the expected fragment length distribution mean and standard deviation, which can be given using -m and -d, respectively. If these are not given the method will look for the parameters in the alignment file and pick the first values that it finds. The input parameters (-m and -d) are overwritten by the values estimated by rpvg when calculating the read-path probabilities. When the input is single-end reads (-s) the expected mean (-m) and standard deviation (-d) is required as it can not be estimated by rpvg and is needed for the effective path length calculation.

Citing rpvg

Sibbesen, J. A., Eizenga, J. M. et al. Haplotype-aware pantranscriptome analyses using spliced pangenome graphs. Nature Methods 20, 239–247 (2023).

Acknowledgement

We thank the developers of the tools and libraries used by rpvg.