jonassibbesen / vgrna-project-paper

Bash scripts and data used in pantranscriptomic paper
MIT License
20 stars 3 forks source link

Haplotype-aware pantranscriptome analyses using spliced pangenome graphs

This repository contains the scripts that were used to generate the results presented in Haplotype-aware pantranscriptome analyses using spliced pangenome graphs, bioRxiv (2021).

For more up-to-date information on how to run the different methods please go to the github page of the vg toolkit and rpvg. The spliced pangenome graphs and pantranscriptomes (haplotype-specific transcripts) presented in the paper are avaliable to download in the Data section for use in other projects.

This repository is organized in four subdirectories.

  1. The installation_and_demo directory contains installation directions for vg and rpvg. It also includes a short demo of using the tools for transcriptomic inference, with example data included.

  2. The scripts directory contains the scripts used for analysis and plotting in this project. It is further subdivided by which language the scripts are written in. However, the scripts in the bash subdirectory are not the exact scripts we used. They have been simplified to make them easier for others to use, mainly by removing hard-coded paths and replacing environment-defined variables with variables that can be easily edited.

  3. The originals directory contains the raw, unedited bash scripts, as well as the log files. These files are not particularly user-friendly as they include a lot of hard-coded paths. However, we have included them here for transparency and reproducibility. By looking at the scripts and log files you can see exactly how each method was run in the paper. Most of the log files will include a short header which specifies the Docker image that was used. The Docker files used for the Docker containers are available in the dockerfiles directory. For the log files without this header it should be clear from the script itself what version was used.

  4. The dockerfiles directory contains recommended Docker files for running scripts in this repository.

Data

Here you can find links to the data used in the paper. This includes both raw data and data constructed as part of the analyses in the paper. The constructed data included here is data that are either not guaranteed to be reproducible (subsampled transcript annotation and simulated reads) or that are deemed potentially useful in other projects (graphs, pantranscriptomes and indexes).

Graphs, pantranscriptomes and indexes

The spliced pangenome graphs, pantranscriptomes and indexes:

Genome

The GRCh38 (primary assembly) reference genome:

Transcripts

The GENCODE v29 (primary assembly) transcript annotation:

The subsampled (80%) GENCODE v29 transcript annotation:

Variants and haplotypes

The 1000 Genomes Project variants and haplotypes lifted to GRCh38:

The IPD-IMGT/HLA gene allele sequences:

Reads

The simulated RNA-seq reads:

The real RNA-seq reads:

The Iso-Seq alignments:

Mapping and expression data

Mapping benchmark tables and haplotype-specific expression estimates: