Finds complexes from Blue-Native and SEC Fractionation analyzed by Liquid Chromatogrpahy coupled to Mass Spectrometry. In principal it works with any separation technique resulting in co-elution signal profiles. To avoid licence issues and accumulation of old database files, please first download the database of choce (see below Download Protein-Protein Interaction Data).
We list here several features that we will implement in the next versions. If checked, they are already available but might be still experimental.
ComplexFinder(analysisName="
ComplexFinder(analysisName="
ComplexFinder(analysisName="
ComplexFinder(analysisName="
- [ ] Test a DeepLearning Implementation
- [ ] AlignedUMAP for stable complexes
## Workflow
For thousdands of features (peptides/protein) a signal was measured over different fractions. The applied technique separates protein clusters from each other. This package aims for different things:
* signal processing including filtering, smoothing.
* if more than one replicate is analysed, the profiles over fractions will be aligned.
* identification of protein-protein interactions.
* identification of protein cluster using diminesional reduction and density based clustering.
![Signal processing and protein-protein interaction prediction](/img/workflow.png)
Importantly, ComplexFinder can also be utized to analyse the data without prior knowledge of protein connectivitiy (e.g. positive database). In this case, there are two options:
* using raw profile signal intensities
* distance between profile pairs
which are then subjected for dimensional reduction and HDBSCAN clusering. Importantly, when using the raw profile intensities, the derived UMAP representation is aligned using the top N correlated features between samples (e.g. same protein across all samples).
As a next step, we want to identify clusters of proteins with predicted interaction. To this end, we are using the interaction probabiliy matrix obtained by the
random forest classifier. We then apply the UMAP embedding calculton and apply HDBSCAN clustering. Again, we
are using the CORUM database to quantify the the clustering result using the v-measure. Both techniques, UMAP and HDBSCAN are performed
using a paramter grid to cycle through different options and find the best clustering.
![Quantification](/img/quantDetails.png)
In cases of uisng the raw signal intensity or the distance metrics, those data are subjected to dimensional reduction (UMAP) and clustering (HDBSCAN). Noteworthy, other clusering algorithmns are available and can be utilized. HDBSCAN is however the default.
## Depositing Data analyzed using ComplexFinder
If you analyzed your data using ComplexFinder, we highly recommend to upload the data along the raw fiiles deposition at mass spectrometry repisatories such as PRIDE / ProteomeXChange or similiar. Especially, the params.json file which is written to the results folder is of particular interest in order to reproduce the data analysis. Of note, if you upload the complete result folder, other users will be able to analyse these data using the plotting utilities of ComplexFinder.
## Installation
Download the zip file containing the source code from github.
Navigate to the folder in terminal/command line tool.
On Mac / Linux:
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
For windows user:
py -m venv env
.\env\Scripts\activate
pip3 install -r requirements.txt
## Usage Example
Upon downlaod and extraction of the package. You can find example data in the example-data folder.
To run the anaylsis, you can enter:
```python
from .src.main import ComplexFinder
X = pd.read_table("./example-data/SILAC_01.txt", sep = "\t") #loading tab delimited txt file.
ComplexFinder(analysisName = "ExampleRun_01").run(X)
You can also pass a folder path to run. This will yield in the anaylsis of each txt file in the folder.
import os
from .src.main import ComplexFinder
folderPath = os.path(".","<my folder>")
ComplexFinder().run(folderPath)
Additionally, you can pass a list of datasets. However, we recommend to copy required datasets in a separate folder. Results are saved by default in the results folder the ComplexFinder folder.
To run the ComplexFinder pipeline, you have to provide a positive set protein-protein interactions. Below we provide examples and specific settings for frequently used databases of protein-protein interactions.
Download the dataset from the Website link and save it to reference-data folder in ComplexFinder. If not present, add a column with the header ComplexID providing a unique ID for each complex. The CORUM database contains complexes for mammalian systems therefore we need to pass a filterDictionary as shown below (databaseFilter). You can pass any column of the database as the key, and the target value for which we want to filter as a list. The parameter databaseEntrySplitString gives the splitstring by which the Uniprot identifiers (or any other feature ID matching your input data) of complexes are separated.
ComplexFinder(
...
databaseFilter = {'Organism': ["Human"]},
databaseIDColumn = "subunits(UniProt IDs)",
databaseEntrySplitString = ";",
databaseFileName = "CORUM.txt" #depends on how you save the COURM database
).run(...)
Go the Complex Portal Website and download the database (save it as HUMAN_COMPLEX_PORTAL.txt) for the utilized organismn.
ComplexFinder(
databaseFileName="HUMAN_COMPLEX_PORTAL.txt", #depends on how you save the Complex Portal database
databaseIDColumn= "Expanded participant list",
databaseEntrySplitString = "|",
databaseFilter = {}
).run(...)
The hu.MAP 2.0 has recently beend published and is available at this link.
ComplexFinder(
databaseFileName="humap2.txt", #depends on how you save the Complex Portal database
databaseIDColumn= "subunits(UniProt IDs)", #requires renaming
databaseEntrySplitString = ";",
databaseFilter = {"Confidence":[1,2,3,4]} #example to filter for a spcific complex confidence
).run(...)
The grouping parameter in ComplexFinder is used to group files, which is used to group replicates together. Assume, that we have 4 files, 2 KO and 2 WT files which we put together in the folder "./data". The grouping will be used to calculate pariwise statistics between fitted peaks. Moreover, complex prediction and protein-protein prediction summary.
pathToFiles = os.path.join(".","data")
ComplexFinder(
grouping = {
"WT" : ["D3_WT_01.txt","D3_WT_02.txt"],
"KO" : ["D3_KO_01.txt","D3_KO_02.txt"]
}
).run(pathToFiles)
Find below an overview about the extensive output of ComplexFinder.
Please respect the respective liscence for the different databases.
Find below parameters to set. The default is given in brackets after the parameter name.
#random forest grid search
RF_GRID_SEARCH = {
'max_depth': [70],
'max_features': ['sqrt','auto'],
'min_samples_leaf': [2, 3, 4],
'min_samples_split': [2, 3, 4],
'n_estimators': [600]
}
Sklearn library is used for predictions. Please check the comprehensive documention for more details and for construction of a grid search dict.
For the prediction of protein-protein interactions the quality and size of the database is of importance.
As a quick test, we performed predictions using 2000 randomly selected features of dataset D1 and siwtched the class labels (interactor vs non-interactor) of the database to train the classifier. We observed that the number of predicted protein-protein interaction was strongly reduced in after label switch of more than 5% of the features. We have used the CORUM human database for interactions. This highlights that the complexes in the database need to describe the complexome in the measured dataset accurately. The gold-standard is therefore the usage of a complex database that were experimentally validated, which is sadly often not possible due to the workload.
in preparation
ComplexFinder allows peak centric quantification using different quantification strategies.
TMT allows for multiplexing in complexome experiments by labeling peptides with different tags that can be distinguished by different reporter ions using LC-MS/MS. Therefore the result (for example from a MaxQuant analysis) that is required are:
For each peak in the samples, ComplexFinder will extract the TMT intensities and will aggreagte the fraction covered by the FWHM using a given function. By default the sum is used but can be changed to the mean as well (TMTPoolMethod = "sum"). The data in the quantification files (feature IDs x TMT Intensities for each fraction) are not transformed at all. Therfore, if you use the mean, performing log2 transformation before averageing might be advisable. You can do this by setting the paramter transformQuantDataBy = "log2"*. The available options are ["log2","ln",None]. None being the default which will use the provided values.
A combination of SILAC and TMT allows either for extended mulitplexing (2 x SILAC Channel + 10plex TMT = 20 samples) or to follow an incoporation kinetic. To this end, cells are grown on SILAC media (for example heavy) for several passages leading to fully labelled cells. Then, the media is exchanged to light media and the cell start incoporating light amino acids into newly synthesized proteins. This enabled the determination of incorporation rates / turnover rates. When combining TMT and SILAC together, the light channel peptides + TMT represent the SILAC incoporation and heavy shows the break-down of proteins. In proliferating cells, the increase in biomass (cell growth) has to be considered.
Please note that at the moment only two SILAC channels are supported.
The general strategy in complex finder for peak-centric quantification is shown in the figure above. For detected peaks, the FWHM is determined and the TMT intensities are summed over the respective fractions. Of note, for very small peaks this might a single fraction which is by default prevented. This can be allowed by setting the parameter (allowSingleFractionQuant to True).
ComplexFinder(allowSingleFractionQuant = True).run(...)
Figure. Quantification Stategy using TMT or SILAC-TMT experimental designs. In SILAC-TMT experimental designs, two quantification resutl files are required index by HEAVY and LIGHT.
We recommend to put the signal profiles in a folder (in the figure: myCoolAnalysis) and add the files. Create a new folder within myCoolAnalysis called 'q' in which you add the quantification data. If you put the quantification txt files in the same folder as the once for analysis, ComplexFinder will treat them as signal profiles and will try to fit model peaks to them etc.
To calculated the fit parameter for a single order kinetic, we have to provide more information otherwise, the output will contain the TMT intensities for each peak indicated by heavy or light. ComplexFinder expects a raw TMT intensities (not log2) for
I get the Error message: no positive hits found in database. What does it mean?
Please check the class argument databaseFilter of type dict. For example the default is
databaseFilter = {'Organism': ["Human"]}
This means that the database is filtered on column 'Organism' using "Human" as the search string.
The required format for the database is a tab-delimited txt file. The file must contain the columns: ComplexID and ComplexName. Additionally, the pipeline requires a column with the feature IDs (same ID as in the provided co-elution/migration data) of a complex divided by a ";". Easiest might be to check the default parameter which can be used to upload the CORUM database. If you want to use ComplexFinder without a database, check out the FAQ (How to run the pipeline without a database?) below.
The peak built in models are from the package lmfit.
#in the Signal module
#import from lmfit
from lmfit import models
#line 208, gets the model based on the string
model = getattr(models, basis_func['type'])(prefix=prefix)
Therefore, you can provide any string that matches a model name in the lmfit package. Please note that, only peak parameters and constraints are implemented and tested for Gaussian, Lorentzian and Skewed Gaussian. So if your fit does not work, you may want to check the following function of the Signal.py class module.
def _addParams(self,modelParams,prefix,peakIdx,i):
"""
Adds parameter to the modelParam object
Parameters
----------
mdeolParams :
modelParam object. Returned by model.make_params() (lmfit package)
Documentation: https://lmfit.github.io/lmfit-py/model.html
prefix : str
Prefix for the model (e.g. peak), defaults to f'm{i}_'.format(i)
peakIdx : int
Arary index at which the peak was detected in the Signal arary self.Y
i : int
index of detected models
Returns
-------
None
"""
self._addParam(modelParams,
name=prefix+'amplitude',
max = self.Y[peakIdx[i]] * 9,
value = self.Y[peakIdx[i]] * 2,
min = self.Y[peakIdx[i]] * 1.2)
self._addParam(modelParams,
name=prefix+'sigma',
value = 0.255,
min = 0.01,
max = 2.5)
self._addParam(modelParams,
name=prefix+'center',
value = peakIdx[i],
min = peakIdx[i] - 0.2,
max = peakIdx[i] + 0.2)
## enter other model params here, you may have to change the min and max
## for the other parameters as well to get a nice fit.
Please not that you also have to alter the functions _getHeight and _getFWHM for your peak models. You can check the equations here.
In the future, we would like to implement the following features:
Test various peak models for better performance.
The following python packages are required to run the scripts (from the requirements.txt file.)
If you found usage of this piepline helpful, please consider citation of the following paper. We highly appreciate any acknowledgement.
Info on publication status: Paper submitted.
Of note, please use the Issue GitHub functionality of this repository to report bugs. Nevertheless, you can contact us if you have any question or requests for a feature functions of the pipeline via e-mail.