iq: an R package for protein quantification
This R package provides an implementation of the MaxLFQ algorithm by Cox et al. (2014) in a comprehensive pipeline for DIA-MS (Pham et al. 2020). It also offers options for protein quantification using the N most intense fragment ions, using all fragment ions, and the Tukey's median polish algorithm. In general, the tool can be used to integrate multiple proportional observations into a single quantitative value.
Citation
Pham TV, Henneman AA, Jimenez CR. iq: an R package to estimate relative protein abundances from ion quantification in DIA-MS-based proteomics, Bioinformatics 2020 Apr 15;36(8):2611-2613.
https://doi.org/10.1093/bioinformatics/btz961
Installation
The package is hosted on CRAN. It is best to install from within R.
install.packages("iq")
Usage
The package can be loaded in the usual manner
library("iq")
See a recent example for processing a Spectronaut output.
Or an older vignette for processing output from Spectronaut, OpenSWATH and MaxQuant with some visualization.
To process a DIA-NN output
process_long_format("report.tsv",
output_filename = "report-pg-global.tsv",
annotation_col = c("Protein.Names", "Genes"),
filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))
Use the paramter peptide_extractor
if you want to get the number of peptides per protein.
process_long_format("report.tsv",
output_filename = "report-pg-global.tsv",
annotation_col = c("Protein.Names", "Genes"),
filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"),
peptide_extractor = function(x) gsub("[0-9].*$", "", x))
To process a Spectronaut output
Use this export schema iq.rs to make a long report, for example "Spectronaut_Report.xls".
process_long_format("Spectronaut_Report.xls",
output_filename = "iq-MaxLFQ.tsv",
sample_id = "R.FileName",
primary_id = "PG.ProteinGroups",
secondary_id = c("EG.Library", "FG.Id", "FG.Charge", "F.FrgIon", "F.Charge", "F.FrgLossType"),
intensity_col = "F.PeakArea",
annotation_col = c("PG.Genes", "PG.ProteinNames", "PG.FastaFiles"),
filter_string_equal = c("F.ExcludedFromQuantification" = "False"),
filter_double_less = c("PG.Qvalue" = "0.01", "EG.Qvalue" = "0.01"),
log2_intensity_cutoff = 0)
Examples
Processing a DIA-NN output
#### Processing a DIA-NN output
We processed the Bruderer15 dataset (Bruderer et al., _MCP_ 2015) using [DIA-NN](https://github.com/vdemichev/DiaNN) 1.8.1 (Demichev et al., _Nature Methods_ 2020) with the switch `--report-lib-info` turned on. The switch gives an extra column `Fragment.Info` to test iq's new feature. Usually, the user does not need to turn it on.
Download [DIA-NN output report.zip](https://github.com/tvpham/iq/releases/download/v1.1/report.zip) and unzip the file to obtain the long format ouput `report.tsv`.
A protein group report for downstream analysis can be obtained with a single statement in R
```
library(iq)
process_long_format("report.tsv", output_filename = "report-pg.tsv")
```
As an aside, usually the user wants to include additional annotation columns in the final report. This is possible by specifying the `annotation_col` parameter.
```
process_long_format("report.tsv", output_filename = "report-pg-annotated.tsv",
annotation_col = c("Protein.Names", "Genes"))
```
Also, by default, we filter the report at 1% run-sepcific FDR at both precursor level and protein group level. To filter global FDR (available with a recent version of DIA-NN), use the parameter `filter_double_less`.
```
process_long_format("report.tsv", output_filename = "report-pg-global.tsv",
annotation_col = c("Protein.Names", "Genes"),
filter_double_less = c("Global.Q.Value" = "0.01", "Global.PG.Q.Value" = "0.01"))
```
We continue with loading the result file `report-pg.tsv` to check the quantitative values of the spike-in proteins. For convenience, we will use the spike-in protein names instead of the protein group names.
```
pg <- read.delim("report-pg.tsv")
rownames(pg) <- pg$Protein.Group
spike_ins <- c("P02754", "P80025", "P00921", "P00366", "P02662", # mix 1
"P61823", "P02789", "P12799", "P02676", "P02672", # mix 2
"P02666", "P68082") # mix 3
# change rownames to spike_ins names
for (i in 1: length(spike_ins)) {
rownames(pg)[grep(spike_ins[i], rownames(pg))] <- spike_ins[i]
}
```
##### Mix 1
Here is the ground truth for proteins in Mix 1 (P02754, P80025, P00921, P00366, P02662). Each sample was measured in triplicate.
|Sample|relative|fmol/ul|
|------|--------|-------|
|1|1|1.5|
|2|1.1|1.65|
|3|1.21|1.815|
|4|1.33|1.995|
|5|10|15|
|6|11.01|16.515|
|7|12.11|18.165|
|8|13.33|19.995|
```
matplot(t(pg[spike_ins[1:5], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 1", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[1:5], col = 1:5, pch=19, bty = "n")
```
![Mix 1](images/spikein-mix1.svg)
We see that the triplicates are consistent and the spikeins in first 4 samples (12 runs) are about 10 fold down as expected (log2(10) ~ 3.3).
##### Mix 2
Here is the ground truth of proteins in Mix 2 (P61823, P02789, P12799, P02676, P02672).
|Sample|relative|fmol/ul|
|------|--------|-------|
|1|200|100|
|2|125.99|62.995|
|3|79.37|39.685|
|4|50|25|
|5|4|2|
|6|2.52|1.26|
|7|1.59|0.795|
|8|1|0.5|
```
matplot(t(pg[spike_ins[6:10], 2:25]), type = 'b', col = 1:5 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 2", xlab = "8 samples x 3 replicates")
legend("topright", legend = spike_ins[6:10], col = 1:5, pch=19, bty = "n")
```
![Mix 2](images/spikein-mix2.svg)
In this mix, proteins in the first 4 samples are higher. Fold change between sample 1 and sample 5 is 50, which is approximately 5.6 in log2 space.
##### Mix 3
Here is the ground truth of proteins in Mix 3 (P02666, P68082).
|Sample|relative|fmol/ul|
|------|--------|-------|
|1|1|0.05|
|2|4|0.2|
|3|16|0.8|
|4|64|3.2|
|5|256|12.8|
|6|1024|51.2|
|7|4096|204.8|
|8|16384|819.2|
```
matplot(t(pg[spike_ins[11:12], 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Mix 3", xlab = "8 samples x 3 replicates")
legend("topleft", legend = spike_ins[11:12], col = 1:2, pch=19, bty = "n")
```
![Mix 3](images/spikein-mix3.svg)
In this mix, the protein concentration increases 4 fold from sample 1 to sample 8. Thus, we expect a staircase shape with an increase of 2 in log2 space.
##### Random 12 proteins
```
set.seed(0)
matplot(t(pg[sample(1:nrow(pg), 12), 2:25]), type = 'b', col = 1:6 , pch=19, lwd = 3,
ylab="log2 MaxLFQ", main = "Random 12 proteins", xlab = "8 samples x 3 replicates")
```
![Random 12](images/spikein-random.svg)
This is a set of 12 random proteins in the background. High abundance proteins show a consistent pattern while lower abundance proteins exhibits more variation and missing data, which is expected from mass spectrometry-based proteomics data.
Spectronaut output versus DIA-NN output
#### Spectronaut output versus DIA-NN output
The iq package has a dataset of 12 spikein protein from the Bruderer 15 dataset processed by the Spectronaut software. We will turn this dataset into something like DIA-NN output to better understand the parameters. The following will produce a file `spikeins-diann.tsv` in your workspace.
```
library(iq)
data("spikeins")
precursor <- paste(spikeins$EG.ModifiedSequence, spikeins$FG.Charge, sep = "_")
d <- NULL
for (p in unique(precursor)) {
b <- spikeins[precursor == p,]
for (s in unique(as.character(b$R.Condition))) {
ss <- b[b$R.Condition == s,]
d <- rbind(d, list(File.Name = s,
Protein.Group = as.character(ss$PG.ProteinGroups[1]),
Genes = as.character(ss$PG.Genes[1]),
Precursor.Id = p,
Fragment.Quant.Corrected = paste(ss$F.PeakArea, collapse = ";"),
Fragment.Info = paste(paste(ss$F.FrgIon, ss$F.Charge, sep="_"),
collapse = ";")))
}
}
write.table(d, "spikeins-diann.tsv", sep = "\t", row.names = FALSE, quote = FALSE)
```
Examine the file `spikeins-diann.tsv` to see the difference with Spectronaut output, for example by typing `head(spikeins)` in your R console.
##### Standard iq processing
This is an example code in standard iq processing. The result `result$estimate` contains quantitative values of the 12 proteins. Note that the data has been normalized. So in the preprocessing, we set median normalization to `FALSE`.
```
norm_data <- preprocess(spikeins, median_normalization = FALSE, pdf_out = NULL)
protein_list <- create_protein_list(norm_data)
result <- create_protein_table(protein_list)
```
##### Process DIA-NN output using pure R
The standard R processing for DIA-NN output is as follows. Because the default values were set to Spectronaut output, you will need to specify the columns values. Note that `intensity_col` and `intensity_col_id` contain `;` separated values.
```
raw <- read.delim("spikeins-diann.tsv")
norm_data2 <- preprocess(raw,
sample_id = "File.Name",
primary_id = "Protein.Group",
secondary_id = "Precursor.Id",
intensity_col = "Fragment.Quant.Corrected",
median_normalization = FALSE,
intensity_col_sep = ";",
intensity_col_id = "Fragment.Info")
protein_list2 <- create_protein_list(norm_data2)
result2 <- create_protein_table(protein_list2)
```
The result of processing DIA-NN output should be equal to that of processing Spectronaut data (up to machine numerical accuracy).
```
max(result$estimate - result2$estimate)
min(result$estimate - result2$estimate)
identical(rownames(result$estimate), rownames(result2$estimate))
identical(colnames(result$estimate), colnames(result2$estimate))
```
You can plot a particular protein to see all its observed fragments.
```
plot_protein(protein_list2$P00366, main = "Protein P00366")
```
![P00366](images/P00366.svg)
##### Fast processing
You can also use the fast implementation of MaxLFQ. Again, the parameters need to be specified because the data has been filtered and normalized.
```
iq_dat <- fast_read("spikeins-diann.tsv",
sample_id = "File.Name",
primary_id = "Protein.Group",
secondary_id = "Precursor.Id",
intensity_col = "Fragment.Quant.Corrected",
intensity_col_sep = ";",
intensity_col_id = "Fragment.Info",
filter_string_equal = NULL,
filter_double_less = NULL,
annotation_col = NULL)
iq_norm_data <- fast_preprocess(iq_dat$quant_table,
median_normalization = FALSE)
result_fastest <- fast_MaxLFQ(iq_norm_data,
row_names = iq_dat$protein[, 1],
col_names = iq_dat$sample)
max(result$estimate - result_fastest$estimate)
min(result$estimate - result_fastest$estimate)
identical(rownames(result$estimate), rownames(result_fastest$estimate))
identical(colnames(result$estimate), colnames(result_fastest$estimate))
```
##### Oneliner
Finally, the data can be processed in one statement. Note that the parameter `normalization` is not a logical (`TRUE` or `FALSE`), but a string `"none"` or `"median"` (default). This is to support different normalization methods in the future.
```
process_long_format("spikeins-diann.tsv", output_filename = "spikeins-diann-out.tsv",
filter_double_less = NULL,
normalization = "none")
```
The output `spikeins-diann-out.tsv` should be equal to the existing R processing.
```
out <- read.delim("spikeins-diann-out.tsv")
rownames(out) <- out$Protein.Group
out <- out[rownames(result$estimate),]
max(result$estimate - out[, colnames(result$estimate)])
min(result$estimate - out[, colnames(result$estimate)])
```