jeffsocal / tidyproteomics

An S3 data object and framework for common quantitative proteomic analyses
https://jeffsocal.github.io/tidyproteomics/
MIT License
36 stars 5 forks source link

tidyproteomics

An R package for the tidy-ing, post processing and analysis of quantitative proteomic data.

Proteomics analysis software, available either through a paid subscription or as an open-source tool, fail to output data in a well conceived tidy format. A majority of these tools generate output formats that have either mixed wide- and long-format data structures, columns headers with messy names and added symbols, and often confusing variable names. This leads researchers to create one-off scripts for cleaning and importing data from various formats, often creating an environment of unmaintained, bespoke code. This package attempts to solve that problem by creating a flexible import tool to unify multiple formats and create an new tidy R object for proteomics analysis.

This package supports at a high level:

Links

Publication

Jones, J., MacKrell, E.J., Wang, TY. et al. Tidyproteomics: an open-source R package and data object for quantitative proteomics post analysis and visualization. BMC Bioinformatics 24, 239 (2023)

R Package

RShiny App

Data Support

Importing is currently implemented for a few platforms and assume peptide level FDR (at the user's desired level) has already been accounted for. See vignette("importing"). Importing is flexible enough to accept other data platforms in flat files (.csv, .tsv, and .xlsx) with a custom configuration.

Platform peptides proteins notes
ProteomeDiscoverer *.xlsx peptides export *.xlsx proteins export requires layout configuration
MaxQuant evidence.txt proteinGroups.txt
FragPipe combined_peptide.tsv combined_protein.tsv
Skyline *.csv MSstats peptide report requires MSstats install
DIA-NN *.tsv peptide report
mzTab *.mzTab (v1.0.0) *.mzTab (v1.0.0) does not track MBR

Ease of Use

This package supports the same syntactic sugar utilized in the tidy-verse functions like filter, and introduces the %like% operator, see vignette("subsetting") . These operations can extend to all aspects of the data set, including sample names, protein IDs, annotations and accountings like match_between_runs and num_peptides.

operator description example
== equals sample == 'wt' , match_between_runs == FALSE
!= does not equal biological_function != 'DNA metabolism'
\<, > less, greater than num_unique_peptides >= 2
%like% contains description %like% 'ribosome'
! %like% does not contain !description %like% 'ribosome'
/ ratio (expression) experiment / control

Expression analysis also utilizes this type of syntax when referencing samples for analysis. For example data %>% expression(knockdown/control) would know to run the differential expression of the sample ko with respect to the sample wt such that positive log2 difference would be up-expressed in ko and a negative log2 differences would be down-expressed in ko.

Nomenclature

The tidyprotoemics package uses a conserved naming convention that facilitates a "contract" with funtions that operate on the data, making it easier to maintain and modify the code. This naming convention ensures that functions are generic to their inputs, being able to operate on a variety of data without having to explicitly name each column and variable which may be named different between data sets. All of the effort of standardizing the variable names in the tidyproteomics data object is done upfront when importing the data. See vignette("importing").

This package divides data into tables similar to those in SQL, making it easy to organize and analyze data. This table structure offers a convenient way to explore relationships between different variables and to filter data according to specific criteria.

Experimental Table

Variable Reference
sample_id A checksum id on the individual data imported, allows for an unambiguous identifier usefull in differentiating samples of the same name
import_file The file referenced at import, allows for multiple files to be merged
sample_file The individual MS file, or for TMT the channel, that contains data for a single observation of a given protein
sample The name of the MS data collected for a given class, collection, of a specimen or a cell culture, etc.
replicate A tidyproteomics derived integer assignment to a collection of same-named samples

[Example]{.underline}

library(tidyproteomics)

hela_proteins$experiments

# A tibble: 6 × 5
  sample_id import_file                sample_file sample    replicate
  <chr>     <chr>                      <chr>       <chr>     <chr>    
1 e9b20ea7  p97KD_HCT116_proteins.xlsx f1          control   1        
2 ef79cc4c  p97KD_HCT116_proteins.xlsx f4          control   2        
3 eebba67b  p97KD_HCT116_proteins.xlsx f5          control   3        
4 ebf4b0fe  p97KD_HCT116_proteins.xlsx f2          knockdown 1        
5 ea36dac9  p97KD_HCT116_proteins.xlsx f3          knockdown 2        
6 ecfd1822  p97KD_HCT116_proteins.xlsx f6          knockdown 3   

Quantitation Table

Variable Reference
sample_id A checksum id on the individual data imported, allows for an unambiguous identifier useful in differentiating samples of the same name
sample The name of the MS data collected for a given class, collection of sampling of tissue, a culture etc.
replicate A tidyproteomics derived integer assignment to a collection of same-named samples
identifier The identifier of the "thing" being measured.
protein In the case of protein-level data, this is simply protein and is usually populated with UniProt accession numbers, names, or other protein identifiers.
protein, peptide, modification In the case of peptide-level data, this is a multiple of protein (accession number), peptide (amino acid sequence) and modification (post-translational modification).
abundance_raw The imported quantitative measure for each protein or peptide. The tidyproteomics data object stores abundance values for the raw data and each normalization. This allows for the direct comparison of normalization methods against the raw data, and for one set of abundance values to be used for subsequent analyses.

[Example]{.underline}

library(tidyproteomics)

hela_proteins$quantitation

# A tibble: 42,330 × 5
   sample_id sample    replicate protein abundance_raw
   <chr>     <chr>     <chr>     <chr>           <dbl>
 1 e9b20ea7  control   1         Q15149    1011259992.
 2 ef79cc4c  control   2         Q15149    1093277593.
 3 eebba67b  control   3         Q15149     980809516.
 4 ebf4b0fe  knockdown 1         Q15149    1410445367.
 5 ea36dac9  knockdown 2         Q15149    1072305561.
 6 ecfd1822  knockdown 3         Q15149    1486561518.
# … with 42,324 more rows

Annotation Table

Variable Reference
identifier The identifier of the "thing" being measured.
term The name of the collection of annotations, such as gene_name or description.
annotation The descriptive annotation.

[Example]{.underline}

library(tidyproteomics)

print(hela_proteins$annotations)

# A tibble: 63,495 × 3
   protein term               annotation                                                                                                          
   <chr>   <chr>              <chr>                                                                                                               
 1 Q15149  description        Plectin OS=Homo sapiens OX=9606 GN=PLEC PE=1 SV=3                                                                   
 2 Q15149  gene_id_entrez     5339                                                                                                                
 3 Q15149  gene_id_ensemble   ENSG00000178209                                                                                                     
 4 Q15149  gene_name          PLEC                                                                                                                
 5 Q15149  biological_process cell differentiation;cell growth;cellular homeostasis;coagulation;defense response;metabolic process
# … with 63,490 more rows

Installation

To install, open R and type:

install.packages("devtools")
devtools::install_github("jeffsocal/tidyproteomics")

You will also need the Bioconductor packages limma, qvalue, and fgsea, to install these type:

install.packages("BiocManager")
BiocManager::install(c("limma","qvalue","fgsea"))

NOTE: There are several other packages required that will be prompted and automatically downloaded from CRAN when installing. Depending on your current system some packages require the installation of OS level libraries for advanced math and string manipulation.

Get Started

If you want to test the features and capabilities of this project and get a better understanding of how it works, you can experiment with the two datasets that are included within the package. These datasets have been specially selected to provide you with the necessary working examples that will help you to get started with the project.

The first dataset is a protein-level dataset, and the second one is a small peptide-level dataset. Both datasets are easily accessible after loading the package. You can use these datasets to practice and explore the different features and functionalities of the package. They are also used throughout the vignettes and code documentation to provide you with clear and concise examples.

Protein level data.

library(tidyproteomics)

hela_protiens

── Quantitative Proteomics Data Object ──

Origin          ProteomeDiscoverer 
                proteins (10.76 MB) 
Composition     6 files 
                2 samples (control, knockdown) 
Quantitation    7055 proteins 
                4 log10 dynamic range 
                28.8% missing values 
Accounting      (4) match_between_runs num_peptides num_unique_peptides num_psms 
Annotations     (9) description gene_id_entrez gene_id_ensemble gene_name biological_process
                cellular_component molecular_function wiki_pathway
                reactome_pathway 

Peptide level data.

library(tidyproteomics)

hela_peptides
── Quantitative Proteomics Data Object ──

Origin          ProteomeDiscoverer 
                peptides (188.87 kB) 
Composition     6 files 
                2 samples (control, knockdown) 
Quantitation    8 proteins 
                3.3 log10 dynamic range 
                21.3% missing values 
Accounting      (3) modifications match_between_runs num_psms 

Import Your Data

Its simple to get started. Make a new project, drop your raw data in a folder labeled data. For more information see vignette("workflow-simple")

library(tidyproteomics)

data <- "./data/some_ProteomeDiscoverer_data.xlsx" %>%
  import("ProteomeDiscoverer", "proteins")

data %>% summary("samples")