Open antortjim opened 6 years ago
Dear @antortjim - thank you for your issue. I will look at it next week (can't do right now), but I definitely like the idea to make moFF
and MSnbase
talk to each other.
Dear @lgatto, thanks for answering so fast!
In the meantime I am trying out https://github.com/statOmics/MSqRob (which depends on MSnBase for the data structures), and I implemented a custom L-M minimisation with Python's scipy. I am not very sure it is correct though, but I can share it.
Cheers Antonio
Dear @antortjim - thank you for your issue. I meant to get back to you already a long time ago but than forgot. It would be great to hear more about your L-M minimisation script.
Dear @lgatto - thanks for looking back into this! I guess it's summer ;)
I don't have time this week as I am submitting my Master thesis on Monday, but next week I have plenty to try get it to work. I am excited to at least attempt to contribute. I will explain it here step by step.
Imports the required modules including the scipy.optimize.root()
function that brings the LM optimizer.
Read in input arguments with argparse
:
# Given a tsv file containing protein ratios, this script performs LM optimization to reach peptide intensities across samples
# Fractions are aggregated
# Ports the equations from http://www.mcponline.org/content/13/9/2513.full.pdf+html
import pandas as pd
import numpy as np
from scipy.optimize import root
from tqdm import tqdm
import argparse
import os.path
import itertools
import subprocess
parser = argparse.ArgumentParser()
parser.add_argument("--exp_design", required=True, help = "Path to table indicating the experimental design")
parser.add_argument("--moff", required=True, help = "Path to peptide_summary_intensity_moFF_run.tab, the moFF output")
parser.add_argument("--n_peps", required=False, help="Set number of peptides to use in the normalization for debugging purposes", type=int)
arguments = vars(parser.parse_args())
output_moff = pd.read_csv(arguments["moff"], sep = "\t")
experimental_design = pd.read_csv(arguments["exp_design", sep = "\t")
experimental_design.sort_values(by = ["Group", "Replicate"], inplace=True)
-Map every column in the moFF file (one per RAW file) to a sample i.e H1-L3 in the exp_rep_row
dict.
-Store every possible pairwise-combination of samples in the exp_rep_combinations
list.
# A list of experiment + replicate combinations (H1, L1, H2, L2, etc)
exp_rep = experimental_design["Experiment"].map(str) + "_" + experimental_design["Replicate"].map(str)
# Dictionary storing the indices of each experiment_replicate in the experimental design
# i.e exp_rep_row[H_1] is a list whith indices leading to the columns that belong to the fractions of H_1 (er_indices)
exp_rep_row = {e_r: np.where(exp_rep == e_r)[0] for e_r in exp_rep}
# Make it a list so that the len can be computed a priori
exp_rep_combinations = list(itertools.combinations(np.unique(exp_rep), 2))
Use the first n_peps
peptides if it is not None
n = output_moff.shape[0]
print("Found {} peptides".format(n))
if arguments["n_peps"] is not None:
n = arguments["n_peps"]
print("Minimising for {} peptides".format(n))
xics_values = output_moff.iloc[:n, 2:].values
Define the equation to be minimised as a function of N, the normalizing factors.
It returns a list H
which is equivalent to the result of the inner summatorial in eq 2 of the paper, i.e the outer summatorial is not necessary to run the minimiser.
def fun_xic_norm(N):
# XICs for peptide P (all samples and fractions are in the same row)
# I_{P,A} will be the sum of the product of XIC*N where condition = A
# I_{P,A} is stored under key A of the I_P dictionary
# H_P is the sum of the square of the absolute value of the logarithm of the ratio between all possible combinations of conditions
# The combination is taken into account if neither I_PA nor I_PB are 0, otherwise it is ignored
normalized_xics = xics_values * N
# EQUATION 1 IN THE PAPER
# Create a dictionary I with a key for each experiment_replicate
# The value of the key is set to the result of extracting the xic for samples belonging to e_r
# and making a colSum. (np.sum(x, axis=1))
# The value is then an array of length equal to the number of peptides where each ith cell
# is the sum of the normalized intensities for peptide i in e_r
I = {e_r: np.sum(normalized_xics[:, er_indices], axis=1) for e_r, er_indices in exp_rep_row.items()}
# EQUATION 2 IN THE PAPER
# For every combination, take the logarithm of the intensity ratio for all peptides
# Take the absolute value, and square it
# Each value in the array is now a summand in the eq 2 of the paper
# The array has shape number of combinations x number of peptides i.e
# the i,j cell will contain the square log peptide intensity ratio for peptide j in combination i.
# Sum all the ratios column wise (i.e for the same combination)
# We get an H for each comparison.
H = np.sum(np.array([np.square(np.abs(np.log(I[c[0]] / I[c[1]]))) for c in exp_rep_combinations]), axis = 0)
# Remove non valid values
H = H[~np.isnan(H)]
H = H[np.isfinite(H)]
return H
Constrain the values that the normalizing factors can take to lower
and upper
#https://stackoverflow.com/questions/43995862/python-multivariate-non-linear-solver-with-constraints?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
def constrainedFunction(x, f, lower, upper, minIncr=0.001):
x = np.asarray(x)
lower = np.asarray(lower)
upper = np.asarray(upper)
xBorder = np.where(x<lower, lower, x)
xBorder = np.where(x>upper, upper,
[lm_norm.py.zip](https://github.com/lgatto/MSnbase/files/2245189/lm_norm.py.zip)
xBorder)
fBorder = f(xBorder)
distFromBorder = (np.sum(np.where(x<lower, lower-x, 0.)) + np.sum(np.where(x>upper, x-upper, 0.)))
return (fBorder + (fBorder + np.where(fBorder>0, minIncr, -minIncr))*distFromBorder)
Find the root (=minimise) using the lm method (Levenberg–Marquardt), constraining each root to the 0-20 interval. The starting value is init_value = 1
The result is a list of n_samples
(number of RAW files) numbers where every ith number is the normalizing factor for the ith RAW file. It is saved to normalization_factors.txt
.
Code not here: They can be used to multiply each column in the moFF file and then sum those representing the different fractions of the same sample. The final result is a new table with the number of columns equal to the number of conditions and replicates, 6 in this dataset (H1, H2, H3, L1, L2, L3,).
def main(minimize=True):
n_samples = xics_values.shape[1]
x0 = np.full(n_samples, init_value)
if minimize:
# find the normalization fators that minimise fun_xic_norm
N = root(constrainedFunction, x0=x0, method="lm", args=(fun_xic_norm, np.array([0,]*n_samples), np.array([20,]*n_samples))).x
print("Saving norm factors to file")
np.savetxt(fname="normalization_factors.txt", X=N, fmt="%10.5f", delimiter="\t")
else:
N = x0
if __name__ == "__main__":
init_value = 1
main(True)
main(False)
Attached a zip with the full Python script. lm_norm.py.zip
I forgot to say that I don't think it is correct as I don't get the right results. See my second comment here https://github.com/statOmics/MSqRob/issues/49 MSqRob does a pretty good job by considering the fractions a random effect. But there's still many unquantified E. coli proteins.
Hi there!!
First of all, thanks for developing an open source tool in R that can handle such complex datasets as those recurrent in proteomics and mass spec.
I am working with the MaxLFQ dataset (paper: http://www.mcponline.org/content/13/9/2513.long) where the developers of the MaxQuant suite mixed the HeLa and E.coli proteomes in 1:1 and 1:3 ratios. One of the crucial steps in their pipeline is the delayed normalization of XICs (extracted ion currents i.e MS1 intensities of the cluster feature detected in the apex intensity extraction step) across fractions of the same sample (for a total of 24 fractions) using a Levenberg-Marquandt minimization approach. The normalization factors used to sum up the peptide intensities across fractions are then set to the values that minimise the overall proteome variance, so called in the paper H(N). It's a nice approach to the problem they formulated in the paragraph below:
I've had a look at the MSnbase documentation, and the issue of sample fractions is mentioned in two sections:
https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.html#2_data_structure_and_content regarding how to load multiple spectra files in either
mzData
,mzXML
andmzML
formats usingreadMSData()
https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.html#13_combining_msnset_instances regarding how to combine multiple runs of the same sample i.e technical replicates if I understood correctly.
However, in none of them is this specific issue of cross-fraction XIC normalization addressed.
Is there any readily available function in the package that could read a table similar to the moFF output:
where rows represent peptides and columns represent
Enabling this feature would make it possible to connect moFF output to MSnbase in a straightforward way in datasets where sample fractionation across runs is present.
Thank you very much for your help beforehand
Cheers, Antonio