kaustubhad / fastman

An R package for fast and efficient visualizing of GWAS results using Q-Q and Manhattan plots directly from PLINK output files.
GNU General Public License v3.0
36 stars 8 forks source link
chromosome chromosome-positions manhattan-plot snps

fastman

Description

An R package for fast and efficient visualizing of GWAS results using Q-Q and Manhattan plots directly from PLINK output files.

Installation

If you are using Rstudio you can use the following code to install the package.

devtools::install_github('kaustubhad/fastman',build_vignettes = TRUE)

If you are not using Rstudio, we would recommend that you install the package without building the vignette.

devtools::install_github('kaustubhad/fastman',build_vignettes = FALSE)

Functions:

1. fastman

Description

Creates a Manhattan plot directly from a PLINK assoc output (or any data frame with chromosome, position, and p-value).

Usage

fastman (m, chr = "CHR", bp = "BP", p = "P", snp, chrlabs, speedup=TRUE, logp = TRUE, scattermore = FALSE,
        col="matlab", maxP=14, sortchr=TRUE, bybp=FALSE, chrsubset, bprange, highlight,
        annotateHighlight=FALSE, annotatePval, colAbovePval=FALSE, col2="greys", annotateTop=TRUE,
        annotationWinMb, annotateN, annotationCol, annotationAngle=45, baseline=NULL, suggestiveline,
        genomewideline, cex=0.4, cex.text=0.4, cex.axis=0.6, scattermoresize = c(3000,1800), 
        geneannotate = FALSE, build, sep="|", border=0, xlab, ylab, xlim, ylim, gap.axis=NA, ...)

Parameters:

Value

A Manhattan Plot

2. fastqq

Description

Creates a quick quantile-quantile plot from GWAS outputs. On a typical imputed assoc file of 10 million SNPs, it reduces plotting time from 400s in qqman to 24s.

Usage

fastqq (p1, p2=NULL, colour, logtransform=TRUE, pairwisecompare=TRUE, speedup=TRUE, lambda=TRUE, maxP=14,
        fix_zero=TRUE, cex=0.6, cex.axis=0.9, xlab, ylab, ...)

Parameters:

Value

3. fastman_gg

Description

Creates a Manhattan ggplot object directly from a PLINK assoc output (or any data frame with chromosome, position, and p-value). The user needs to install and load the ggplot2 package to use this function.

Usage

fastman_gg (m, chr = "CHR", bp = "BP", p = "P", snp, chrlabs, speedup=TRUE, logp = TRUE,
        scattermore = FALSE, repel = FALSE, col="matlab", maxP=14, sortchr=TRUE, bybp=FALSE, chrsubset,
        bprange, highlight, annotateHighlight=FALSE, annotatePval, colAbovePval=FALSE, col2="greys",
        annotateTop=TRUE, annotationWinMb, annotateN, annotationCol, annotationAngle=45, baseline=NULL,
        suggestiveline, genomewideline, cex=0.9, cex.text=1.8, cex.axis=0.6,
        scattermoresize = c(3000,1800), geneannotate = FALSE, build, sep="|",xlab, ylab, xlim, ylim, ...)

Parameters:

Value

A ggplot object

Examples

The fastman package includes functions for creating Manhattan plots and Q-Q plots from GWAS results. Let us first try using the package on a regular PLINK assoc output file. This is a GWAS summary statistics output file from a study by the authors (https://doi.org/10.1038/ncomms10815), available from the GWAS Central repository ( https://www.gwascentral.org/study/HGVST2597)

Reading a regular PLINK assoc output dataset

m=read.table("beard.assoc.linear",header=TRUE,stringsAsFactors=FALSE,sep="\t")

This dataset has results for 8,776,423 SNPs on 22 chromosomes. Let us take a look at the data.

str(m)
'data.frame':   8776423 obs. of  9 variables:
 $ CHR  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ SNP  : chr  "rs58108140" "rs180734498" "rs116400033" "rs62637813" ...
 $ BP   : num  10583 13302 51479 52058 52185 ...
 $ A1   : chr  "A" "T" "A" "C" ...
 $ TEST : chr  "ADD" "ADD" "ADD" "ADD" ...
 $ NMISS: num  1551 1911 1552 2348 2421 ...
 $ BETA : num  0.1486 0.07006 0.02572 0.1391 -0.00482 ...
 $ STAT : num  1.793 0.7667 0.3814 2.035 -0.0403 ...
 $ P    : num  0.0731 0.4433 0.703 0.042 0.9678 ...
head(m)
  CHR         SNP    BP A1 TEST NMISS      BETA     STAT       P
1   1  rs58108140 10583  A  ADD  1551  0.148600  1.79300 0.07314
2   1 rs180734498 13302  T  ADD  1911  0.070060  0.76670 0.44330
3   1 rs116400033 51479  A  ADD  1552  0.025720  0.38140 0.70300
4   1  rs62637813 52058  C  ADD  2348  0.139100  2.03500 0.04199
5   1 rs201374420 52185  T  ADD  2421 -0.004824 -0.04033 0.96780
6   1 rs150021059 52238  T  ADD  2290 -0.103600 -0.86640 0.38640
tail(m)
        CHR         SNP       BP A1 TEST NMISS      BETA     STAT      P
8776418  22 rs144549712 51229855  A  ADD  2206  0.029490  0.52490 0.5997
8776419  22  rs62240042 51233300  T  ADD  1536 -0.009488 -0.27510 0.7833
8776420  22 rs200507571 51236013 AT  ADD  1847  0.003518  0.08454 0.9326
8776421  22   rs3896457 51237063  C  ADD  1802 -0.018700 -0.56280 0.5737
8776422  22 rs149733995 51238249  C  ADD  2268  0.064180  0.99660 0.3190
8776423  22 rs181833046 51243297  T  ADD  2250  0.100500  1.06600 0.2866

Creating Manhattan Plots

From the above dataset, lets generate a basic Manhattan plot.

png("md1.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

Before moving into any further details, let us look into the plethora of options that this package provides us in a single plot. We will understand these options more rigorously as we go into the later parts of this exercise.

Let us compare the time of plot generation with qqman. For this purpose, we are going to use a library tictoc which will record the run time.

library(tictoc)
tic(); png("md1.png", width=10, height=6, units="in", res=300); fastman(m); dev.off(); toc();

This will record the run time of fastman.

77.238 sec elapsed

Now, lets run the same code for qqman.

library(qqman)
tic(); png("md1a.png", width=10, height=6, units="in", res=300); manhattan(m); dev.off(); toc();

Lets find out the run time of qqman.

630.995 sec elapsed

As seen above, fastman reduces the run time drastically.

We recommend using CairoPNG() command instead of the regular png() for generating the plots, as it provides high-quality image outputs. It requires the user to install and load the Cairo package beforehand. Let us generate the previous plot again using CairoPNG().

CairoPNG("cmd1.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

We can change some basic graph parameters. Let us reduce the point size (cex=) to 30% and reduce the font size of the axis labels (cex.axis=) to 50%. We can also change the colour palette (col=) and remove the suggestive (suggestiveline=) and genome-wide (genomewideline=) significance lines.

CairoPNG("cmd2.png", width=10, height=6, units="in", res=300)
fastman(m, cex=0.3, cex.axis=0.5, col="rainbow1", suggestiveline=FALSE, genomewideline=FALSE)
dev.off()

In this plot, if we want to show p-values till 1E-10, we can set maxP=10, instead of changing the ylim.

CairoPNG("cmd2a.png", width=10, height=6, units="in", res=300)
fastman(m, cex=0.3, cex.axis=0.5, col="rainbow1", suggestiveline=FALSE, genomewideline=FALSE, maxP=10)
dev.off()

We can now look into the SNPs of a single chromosome.

CairoPNG("cmd3.png", width=10, height=6, units="in", res=300)
fastman(m, chrsubset=3)
dev.off()

Let us say we are interested in highlighting some particular 3876 SNPs in chromosomes 2 and 5. We have the names of the required SNPs in a character vector called snp1.

str(snp1)
chr [1:3876] "rs1317448" "rs66535929" "rs12616526" "rs185377827" "rs28528792" "rs13411576" "rs28473143" ...
CairoPNG("cmd4.png", width=10, height=6, units="in", res=300)
fastman(m, highlight = snp1)
dev.off()

We can also annotate the highlighted SNPs. By default, only the top SNP in every chromosome will be annotated.

CairoPNG("cmd7.png", width=10, height=6, units="in", res=300)
fastman(m, highlight = snp1, annotateHighlight = TRUE)
dev.off()

Let us discuss the annotations in detail. We have seven input parameters with respect to annotations: annotateHighlight, annotatePval, annotateTop, annotationWinMb, annotateN, annotationCol and annotationAngle. We have already discussed about annotateHighlight. Among the rest, annotatePval and annotateN are annotation criteria, while the other four parameters are annotation modifiers.

As stated above, we can annotate our plot in only two ways, either by providing a p-value threshold or by providing the required number of top SNPs. The p-value threshold can be provided using the annotatePval criterion. The user can provide either the p-value or the negative logarithm as an input for this parameter, whichever is convenient. For example, both 1E-5 and 5 are acceptable input values.

CairoPNG("cmd5.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5)
dev.off()

We can see that, among the SNPs that exceed the provided threshold, only the top SNP in every chromosome is annotated. We have observed the same for annotateHighlight as well. We will come to that later, when we discuss the annotation modifiers.

For now, let us try annotating with gene names instead of SNP IDs. This can be accomplished using the geneannotate=TRUE argument. We have to specify gene build for our data as well. We use the same p-value threshold as the previous plot.

CairoPNG("cmd5g.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, geneannotate = TRUE, build=37)
dev.off()

Let us explore the other annotation criterion annotateN. Lets say we want to annotate the top 20 SNPs of the data.

CairoPNG("cmd8.png", width=10, height=6, units="in", res=300)
fastman(m, annotateN=20)
dev.off()

Lets go back to the observation we made while using annotateHighlight and annotatePval. By default, only the top SNP in every chromosome is annotated. We can override this default rule using the annotation modifier annotateTop. This modifier has a default value TRUE unless mentioned otherwise by the user. Lets consider the example where we are annotating the SNPs beyond a specified p-value threshold. By setting annotateTop=FALSE, we can annotate all the SNPs beyond our threshold instead of only the top SNP per chromosome.

CairoPNG("cmd6.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotateTop=FALSE)
dev.off()

Now we move on to the next annotation modifier annotationWinMb. Instead of annotating the top SNP in every chromosome, if we want to annotate the top SNP within our chosen megabase window, then this is the modifier for us. Now, lets specify a 5-megabase window.

CairoPNG("cmd10.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationWinMb=5)
dev.off()

Among the remaining two modifiers, annotationCol sets the annotation colour and annotationAngle specifies the angle of annotation. The default colour is "gray50" and the default angle is 45 degrees. Let us illustrate with an example.

CairoPNG("cmd11.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="red", annotationAngle= 60)
dev.off()

As stated already, the default annotation colour in our function is "gray50". However, if the user chooses the colour palette "greys" for the plot, then the default annotation colour changes to "green4".

CairoPNG("cmd18.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, col = "greys")
dev.off()

We have the option to provide a column indicating the annotation colour intended for each SNP as a part of the input data frame. In that case, we just need to specify the name of the column in the annotationCol argument. Let us show this in the example below. In our example, we have added a separate column named ANNCOL to the input data frame with our intended annotation colours for each SNP. Using the column we have annotated the first 4 chromosomes with green, chromosomes 5-10 with red and the rest with blue as can be seen in the plot below.

CairoPNG("cmd17.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="ANNCOL")
dev.off()

In case we do not want to annotate any particular set of SNPs we can just provide NA values in the annotationCol column for those SNPs. In our example, we have another column ANNCOL2 in our input dataframe that has NA values corresponding to chromosomes 3,4,5 and 6. The rest of the values of ANNCOL are the same as that of the column ANNCOL. We have provided NA values in this column because we do not want to annotate SNPs in chromosomes 3,4,5 and 6. Let us see what the plot looks like when we run the command using this column for annoationCol argument.

CairoPNG("cmd19.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, annotationCol="ANNCOL2")
dev.off()

We can choose to colour only the points above our specified p-value threshold. The rest of the plot will become grey by default.

CairoPNG("cmd12.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, colAbovePval=TRUE)
dev.off()

In the above plot, we can change the colour of the region below our specified threshold as well.

CairoPNG("cmd13.png", width=10, height=6, units="in", res=300)
fastman(m, annotatePval=1E-5, colAbovePval=TRUE, col2="greens")
dev.off()

As stated previously, our package also supports plotting of results from genomes of non-model organisms (often with hundreds of contigs or many scaffolds). This is an incremental feature, as the qqman package does not support direct plotting of results from non-model organisms (The chromosome column for the input dataset needs to be numeric for qqman).

m=read.table("plink.assoc.fisher",header=TRUE,stringsAsFactors=FALSE,sep= '')

This dataset has results for 57,712 SNPs on 760 chromosomes. Let us look at the data.

str(m)
'data.frame':   57712 obs. of  9 variables:
 $ CHR: chr  "NC_015762.1" "NC_015762.1" "NC_015762.1" "NC_015762.1" ...
 $ SNP: chr  "1139:88:-" "1139:45:-" "1139:28:-" "2518:86:+" ...
 $ BP : int  32821 32864 32881 58664 68491 83326 83332 92777 92937 92945 ...
 $ A1 : chr  "T" "C" "G" "T" ...
 $ F_A: num  0 0.5217 0 0.0222 0.5217 ...
 $ F_U: num  0.0454 0.375 0.0454 0.0465 0.3636 ...
 $ A2 : chr  "C" "T" "A" "C" ...
 $ P  : num  0.0551 0.0528 0.0551 0.436 0.0366 ...
 $ OR : num  0 1.818 0 0.466 1.909 ...
head(m)
          CHR       SNP    BP A1     F_A     F_U A2       P     OR
1 NC_015762.1 1139:88:- 32821  T 0.00000 0.04545  C 0.05513 0.0000
2 NC_015762.1 1139:45:- 32864  C 0.52170 0.37500  T 0.05276 1.8180
3 NC_015762.1 1139:28:- 32881  G 0.00000 0.04545  A 0.05513 0.0000
4 NC_015762.1 2518:86:+ 58664  T 0.02222 0.04651  C 0.43600 0.4659
5 NC_015762.1 3702:96:- 68491  C 0.52170 0.36360  T 0.03664 1.9090
6 NC_015762.1 4414:33:- 83326  A 0.50000 0.31820  G 0.01556 2.1430
tail(m)
                 CHR           SNP  BP A1     F_A     F_U A2      P     OR
57707 NW_003570988.1 18013690:37:- 547  G 0.06818 0.08333  C 0.7782 0.8049
57708 NW_003570988.1 18013681:25:- 550  A 0.00000 0.01190  G 0.4941 0.0000
57709 NW_003570988.1 18013687:31:- 550  A 0.01282 0.01389  G 1.0000 0.9221
57710 NW_003570988.1  18013670:8:- 556  T 0.12820 0.19230  A 0.3830 0.6176
57711 NW_003570988.1  18013666:3:- 557  A 0.05128 0.08571  G 0.5178 0.5766
57712 NW_003570988.1 18013689:26:- 557  A 0.02632 0.01282  G 0.6176 2.0810

The dataset contains P-values (column P) and two scores (F_A and F_U). From the dataset, lets generate a basic Manhattan plot for P-values.

CairoPNG("cmd14.png", width=10, height=6, units="in", res=300)
fastman(m)
dev.off()

Now let us plot the Manhattan plots for one of the scores. We must note that the scores do not need log transformations and so we must specify that while running.

CairoPNG("cmd15.png", width=10, height=6, units="in", res=300)
fastman(m, p = "F_A", logp = FALSE)
dev.off()

The colour schemes available for this package are provided below.