hnolCol / ComplexFinder

Finds complexes from Blue-Native and SEC Fractionation Complexome Profiling Data. Each fraction is usually analysed by Liquid Chromatography coupled to Mass Spectrometry. (LC-MS/MS)
MIT License
7 stars 3 forks source link
bn-page complexome proteomics python3

ComplexFinder

Finds complexes from Blue-Native and SEC Fractionation analyzed by Liquid Chromatogrpahy coupled to Mass Spectrometry. In principal it works with any separation technique resulting in co-elution signal profiles. To avoid licence issues and accumulation of old database files, please first download the database of choce (see below Download Protein-Protein Interaction Data).

Next Feature (Testing) Implementations (05.2021)

We list here several features that we will implement in the next versions. If checked, they are already available but might be still experimental.

plotting selected feature's profile

ComplexFinder(analysisName="").plotFeature()

plotting selected feature's distance metrics compared to the complete population (all features)

due to scaling of features prior training of the predictor, boxplot should display scaled and raw features.

ComplexFinder(analysisName="").plotFeatureDistanceMetrics()

plotting features of complex found by ComplexFinder (clusterLabels == ID)

ComplexFinder(analysisName="").plotComplexProfileByClusterLabel()

plotting features of known complex in database (correspondng to ComplexID column in the database - see below)

ComplexFinder(analysisName="").plotComplexProfileInDatabaseByID()

- [ ] Test a DeepLearning Implementation
- [ ] AlignedUMAP for stable complexes 

## Workflow

For thousdands of features (peptides/protein) a signal was measured over different fractions. The applied technique separates protein clusters from each other. This package aims for different things:

* signal processing including filtering, smoothing.
* if more than one replicate is analysed, the profiles over fractions will be aligned. 
* identification of protein-protein interactions.
* identification of protein cluster using diminesional reduction and density based clustering. 

![Signal processing and protein-protein interaction prediction](/img/workflow.png)

Importantly, ComplexFinder can also be utized to analyse the data without prior knowledge of protein connectivitiy (e.g. positive database). In this case, there are two options: 

* using raw profile signal intensities
* distance between profile pairs 

which are then subjected for dimensional reduction and HDBSCAN clusering. Importantly, when using the raw profile intensities, the derived UMAP representation is aligned using the top N correlated features between samples (e.g. same protein across all samples). 

As a next step, we want to identify clusters of proteins with predicted interaction. To this end, we are using the interaction probabiliy matrix obtained by the
random forest classifier. We then apply the UMAP embedding calculton and apply HDBSCAN clustering. Again, we 
are using the CORUM database to quantify the the clustering result using the v-measure. Both techniques, UMAP and HDBSCAN are performed 
using a paramter grid to cycle through different options and find the best clustering. 

 ![Quantification](/img/quantDetails.png)

 In cases of uisng the raw signal intensity or the distance metrics, those data are subjected to dimensional reduction (UMAP) and clustering (HDBSCAN). Noteworthy, other clusering algorithmns are available and can be utilized. HDBSCAN is however the default. 

## Depositing Data analyzed using ComplexFinder

If you analyzed your data using ComplexFinder, we highly recommend to upload the data along the raw fiiles deposition at mass spectrometry repisatories such as PRIDE / ProteomeXChange or similiar. Especially, the params.json file which is written to the results folder is of particular interest in order to reproduce the data analysis. Of note, if you upload the complete result folder, other users will be able to analyse these data using the plotting utilities of ComplexFinder.

## Installation

Download the zip file containing the source code from github.
Navigate to the folder in terminal/command line tool.
On Mac / Linux:

create virt env

python3 -m venv env

activate

source env/bin/activate

install packages from req file

pip install -r requirements.txt

For windows user:

create virt env

py -m venv env

actve

.\env\Scripts\activate

install packages from req file

pip3 install -r requirements.txt


## Usage Example

Upon downlaod and extraction of the package. You can find example data in the example-data folder. 
To run the anaylsis, you can enter:
```python
from .src.main import ComplexFinder
X = pd.read_table("./example-data/SILAC_01.txt", sep = "\t") #loading tab delimited txt file. 
ComplexFinder(analysisName = "ExampleRun_01").run(X)

You can also pass a folder path to run. This will yield in the anaylsis of each txt file in the folder.

import os
from .src.main import ComplexFinder
folderPath = os.path(".","<my folder>")
ComplexFinder().run(folderPath)

Additionally, you can pass a list of datasets. However, we recommend to copy required datasets in a separate folder. Results are saved by default in the results folder the ComplexFinder folder.

Download Protein-Protein Interaction Data

To run the ComplexFinder pipeline, you have to provide a positive set protein-protein interactions. Below we provide examples and specific settings for frequently used databases of protein-protein interactions.

CORUM

Download the dataset from the Website link and save it to reference-data folder in ComplexFinder. If not present, add a column with the header ComplexID providing a unique ID for each complex. The CORUM database contains complexes for mammalian systems therefore we need to pass a filterDictionary as shown below (databaseFilter). You can pass any column of the database as the key, and the target value for which we want to filter as a list. The parameter databaseEntrySplitString gives the splitstring by which the Uniprot identifiers (or any other feature ID matching your input data) of complexes are separated.

ComplexFinder(
    ...
    databaseFilter = {'Organism': ["Human"]},
    databaseIDColumn = "subunits(UniProt IDs)",
    databaseEntrySplitString = ";", 
    databaseFileName = "CORUM.txt" #depends on how you save the COURM database
    ).run(...)

Complex Portal

Go the Complex Portal Website and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organismn.

ComplexFinder(
    databaseFileName="HUMAN_COMPLEX_PORTAL.txt", #depends on how you save the Complex Portal database
    databaseIDColumn= "Expanded participant list",
    databaseEntrySplitString = "|",              
    databaseFilter = {}
    ).run(...)

hu.Map 2.0

The hu.MAP 2.0 has recently beend published and is available at this link.

ComplexFinder(
    databaseFileName="humap2.txt", #depends on how you save the Complex Portal database
    databaseIDColumn= "subunits(UniProt IDs)", #requires renaming
    databaseEntrySplitString = ";",              
    databaseFilter = {"Confidence":[1,2,3,4]} #example to filter for a spcific complex confidence
    ).run(...)

Grouping of Replicates

The grouping parameter in ComplexFinder is used to group files, which is used to group replicates together. Assume, that we have 4 files, 2 KO and 2 WT files which we put together in the folder "./data". The grouping will be used to calculate pariwise statistics between fitted peaks. Moreover, complex prediction and protein-protein prediction summary.

pathToFiles = os.path.join(".","data")
ComplexFinder(
    grouping = {
        "WT" : ["D3_WT_01.txt","D3_WT_02.txt"],
        "KO" : ["D3_KO_01.txt","D3_KO_02.txt"]
    }
            ).run(pathToFiles)

ComplexFinder Output

Find below an overview about the extensive output of ComplexFinder.

Using ComplexFinder without protein connectivity

Using raw signal profiles

Using distances metrics

Using just the apex distance

Please respect the respective liscence for the different databases.

Parameters

Find below parameters to set. The default is given in brackets after the parameter name.

Sklearn library is used for predictions. Please check the comprehensive documention for more details and for construction of a grid search dict.

Database Quality

For the prediction of protein-protein interactions the quality and size of the database is of importance.

As a quick test, we performed predictions using 2000 randomly selected features of dataset D1 and siwtched the class labels (interactor vs non-interactor) of the database to train the classifier. We observed that the number of predicted protein-protein interaction was strongly reduced in after label switch of more than 5% of the features. We have used the CORUM human database for interactions. This highlights that the complexes in the database need to describe the complexome in the measured dataset accurately. The gold-standard is therefore the usage of a complex database that were experimentally validated, which is sadly often not possible due to the workload.

Usin SILAC - TMT peak centric quantifiaction

in preparation

ComplexFinder allows peak centric quantification using different quantification strategies.

TMT

TMT allows for multiplexing in complexome experiments by labeling peptides with different tags that can be distinguished by different reporter ions using LC-MS/MS. Therefore the result (for example from a MaxQuant analysis) that is required are:

For each peak in the samples, ComplexFinder will extract the TMT intensities and will aggreagte the fraction covered by the FWHM using a given function. By default the sum is used but can be changed to the mean as well (TMTPoolMethod = "sum"). The data in the quantification files (feature IDs x TMT Intensities for each fraction) are not transformed at all. Therfore, if you use the mean, performing log2 transformation before averageing might be advisable. You can do this by setting the paramter transformQuantDataBy = "log2"*. The available options are ["log2","ln",None]. None being the default which will use the provided values.

SILAC-TMT

A combination of SILAC and TMT allows either for extended mulitplexing (2 x SILAC Channel + 10plex TMT = 20 samples) or to follow an incoporation kinetic. To this end, cells are grown on SILAC media (for example heavy) for several passages leading to fully labelled cells. Then, the media is exchanged to light media and the cell start incoporating light amino acids into newly synthesized proteins. This enabled the determination of incorporation rates / turnover rates. When combining TMT and SILAC together, the light channel peptides + TMT represent the SILAC incoporation and heavy shows the break-down of proteins. In proliferating cells, the increase in biomass (cell growth) has to be considered.

Please note that at the moment only two SILAC channels are supported.

Schematic representation of SILAC-TMT quantification.

The general strategy in complex finder for peak-centric quantification is shown in the figure above. For detected peaks, the FWHM is determined and the TMT intensities are summed over the respective fractions. Of note, for very small peaks this might a single fraction which is by default prevented. This can be allowed by setting the parameter (allowSingleFractionQuant to True).

ComplexFinder(allowSingleFractionQuant = True).run(...)

Quantification using SILAC - how to design the quantFiles parameter. Figure. Quantification Stategy using TMT or SILAC-TMT experimental designs. In SILAC-TMT experimental designs, two quantification resutl files are required index by HEAVY and LIGHT.

We recommend to put the signal profiles in a folder (in the figure: myCoolAnalysis) and add the files. Create a new folder within myCoolAnalysis called 'q' in which you add the quantification data. If you put the quantification txt files in the same folder as the once for analysis, ComplexFinder will treat them as signal profiles and will try to fit model peaks to them etc.

To calculated the fit parameter for a single order kinetic, we have to provide more information otherwise, the output will contain the TMT intensities for each peak indicated by heavy or light. ComplexFinder expects a raw TMT intensities (not log2) for

Frequently asked questions

The required format for the database is a tab-delimited txt file. The file must contain the columns: ComplexID and ComplexName. Additionally, the pipeline requires a column with the feature IDs (same ID as in the provided co-elution/migration data) of a complex divided by a ";". Easiest might be to check the default parameter which can be used to upload the CORUM database. If you want to use ComplexFinder without a database, check out the FAQ (How to run the pipeline without a database?) below.

The peak built in models are from the package lmfit.

#in the Signal module
#import from lmfit
from lmfit import models

#line 208, gets the model based on the string
model = getattr(models, basis_func['type'])(prefix=prefix)

Therefore, you can provide any string that matches a model name in the lmfit package. Please note that, only peak parameters and constraints are implemented and tested for Gaussian, Lorentzian and Skewed Gaussian. So if your fit does not work, you may want to check the following function of the Signal.py class module.

def _addParams(self,modelParams,prefix,peakIdx,i):
        """
        Adds parameter to the modelParam object

        Parameters
        ----------

        mdeolParams : 
            modelParam object. Returned by model.make_params() (lmfit package) 
            Documentation: https://lmfit.github.io/lmfit-py/model.html

        prefix : str
            Prefix for the model (e.g. peak), defaults to f'm{i}_'.format(i)

        peakIdx : int
            Arary index at which the peak was detected in the Signal arary self.Y 

        i : int 
            index of detected models

        Returns
        -------
        None

        """
        self._addParam(modelParams,
                        name=prefix+'amplitude',
                        max = self.Y[peakIdx[i]] * 9,
                        value = self.Y[peakIdx[i]] * 2,
                        min = self.Y[peakIdx[i]] * 1.2)

        self._addParam(modelParams,
                        name=prefix+'sigma', 
                        value = 0.255,
                        min = 0.01, 
                        max = 2.5)

        self._addParam(modelParams,
                        name=prefix+'center', 
                        value = peakIdx[i],
                        min = peakIdx[i] - 0.2, 
                        max = peakIdx[i] + 0.2)

        ## enter other model params here, you may have to change the min and max
        ## for the other parameters as well to get a nice fit. 

Please not that you also have to alter the functions _getHeight and _getFWHM for your peak models. You can check the equations here.

Future Directions

In the future, we would like to implement the following features:

The following python packages are required to run the scripts (from the requirements.txt file.)

Citation/Publication

If you found usage of this piepline helpful, please consider citation of the following paper. We highly appreciate any acknowledgement.

Info on publication status: Paper submitted.

Contact

Of note, please use the Issue GitHub functionality of this repository to report bugs. Nevertheless, you can contact us if you have any question or requests for a feature functions of the pipeline via e-mail.