guigolab / sQTLseekeR

R package to detect splicing QTLs (sQTLs)
http://big.crg.cat/computational_biology_of_rna_processing/sqtlseeker
4 stars 3 forks source link

sQTLseekeR

sQTLseekeR is a R package to detect splicing QTLs (sQTLs), which are variants associated with change in the splicing pattern of a gene. Here, splicing patterns are modeled by the relative expression of the transcripts of a gene.

For more information about the method and performance see article : Monlong, J. et al. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5:4698 doi: 10.1038/ncomms5698 (2014).

Installation

First some Bioconductor packages are required. They can be installed with :

source("http://bioconductor.org/biocLite.R")
biocLite(c("vegan", "Rsamtools", "qvalue"))

Then, install devtools package.

install.packages("devtools")

Finally, install either the latest development version:

devtools::install_github("guigolab/sQTLseekeR")

Or a specific release:

devtools::install_github("guigolab/sQTLseekeR", ref="2.1")

All these installation commands are also written in install.R script (source it to install). These commands work for R 3.2 or higher. For R 3.1, some packages need their older version (see installOnR.3.1.R).

R 3.1 or higher is required.

Analysis steps

For an example analysis with the dummy data in Data, see the R-markdown overview.

The first step is to prepare the input data. sQTLseekeR requires three inputs:

When all input files are correctly formatted sQTLseekeR prepares the data through functions prepare.trans.exp and index.genotype.

Once the input files are ready, sqtl.seeker function will compute the P-values for each pair of gene/SNP testing the association between the genotype and transcript relative expression. Here is a quick description of the parameters that would most likely be tweaked:

The output is a data.frame with for each gene/SNP:

Finally, function sqtls is used to retrieve significant associations. The user can manually define a false discovery rate(FDR) or perform further filtering afterwards. Of note, there is a separate FDR threshold for svQTL removal (if svQTLs were computed), which is usually preferred to stay low (e.g. around 0.01).

An example of an analysis can be found in folder scripts.

Running sQTLseekeR on computing clusters

sQTLseekeR can be used on a cluster using package BatchJobs. An example of an analysis using BatchJobs can be found in folder scripts.

BatchJobs is a potent package but basic functions are enough in our situation. Here is a quick practical summary of BatchJobs commands used in the script:

Another important point about BatchJobs is its configuration for the computing cluster. An example of the configuration files can be found in the scripts folder:

FAQ

sqtl.seeker outputs NULL. What am I doing wrong ?

An output of NULL means there were no gene/SNPs to analyze. It is likely due to inconsistent input files. To debug check that :

If the input files are fine, an output of NULL might be caused by inappropriate transcript expression (e.g. genes with low expression) or genotypes (e.g. many missing values).

What are these svQTLs ?

svQTLs are SNPs associated with splicing variability of a gene. Here the relative transcript expression wight be globally similar between genotype groups but much more variable in specific one. Although the biological interpretation is not straightforward, we use them to flag potential false sQTLs. Indeed, the test for differential transcript relative expression assumes similar variability between genotype groups. Hence if a conservative approach to find sQTLs would be to retrieve significant sQTLs that are not svQTLs.

What about trans-sQTLs ?

By default, sQTLseekeR tests association between a gene and SNPs within or close by (defined by genic.window parameter). Testing association between all SNPs and all genes is not feasible : it would require too much computation and we don't see a good multiple-testing correction for this design that would fit our permuted/simulated P-values.

However, the user can specify exactly which regions should be tested for each gene. Instead of using the gene location for gene.loc parameter, the user can feed the locations of the regions to test. Similarly to the gene location, chr, start, end and geneId columns define the region and the gene to test. Several regions can be defined for a same gene. Of note, in this design, genic.window=0 could be used to ensure that no flanking regions are added to the regions to test.

In the paper you associated a splicing event from the tr.first/tr.second. How ?

We can either use AStalavista or a build-in function in the package (classify.events). For more infos, have a look at the R-markdown tutorial on this.

What is this "maximum difference" (md) column in the output ?

In addition to the F score and P-value, the maximum difference(MD) in relative expression between genotype groups is reported. This is to be used as a measure of the size of the effect. For example, if 'md' is 0.2, it means that there is one transcript whose relative expression shifted by 20% between two genotype groups. Hence the higher the more dramatic the change in transcript usage.

For the same gene I get a several sQTLs but the transcripts (tr.first/tr.second) are different. Why ?

The primary result is the association between SNP(s) and a gene. The two transcripts IDs in the output are here to provide some quick information about what changes the most but it's far from the complete picture and should be taken with a grain of salt.

If the same gene has several sQTLs with different tr.first/tr.second it could be the same changes affecting more than two transcripts. Especially if the SNPs are close together or in LD, it is likely the same association but because of technical noise the most affected transcripts are not always the same two.

In practice, the first thing is to look at the transcript relative expression in the different genotype groups. There you will see if it's mostly a pair of transcript affected or more. Then, it's good to look at the structure of the transcripts. Some transcripts might be very similar. Or maybe they differ only in parts not affected by the SNP. In this situation, it's not surprising for more than two transcripts to be affected.

sqtl.plot function was implemented for this purpose. There is an example at the end of the R-markdown Overview.