lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
124 stars 50 forks source link

Cross-fraction XIC normalization #344

Open antortjim opened 6 years ago

antortjim commented 6 years ago

Hi there!!

First of all, thanks for developing an open source tool in R that can handle such complex datasets as those recurrent in proteomics and mass spec.

I am working with the MaxLFQ dataset (paper: http://www.mcponline.org/content/13/9/2513.long) where the developers of the MaxQuant suite mixed the HeLa and E.coli proteomes in 1:1 and 1:3 ratios. One of the crucial steps in their pipeline is the delayed normalization of XICs (extracted ion currents i.e MS1 intensities of the cluster feature detected in the apex intensity extraction step) across fractions of the same sample (for a total of 24 fractions) using a Levenberg-Marquandt minimization approach. The normalization factors used to sum up the peptide intensities across fractions are then set to the values that minimise the overall proteome variance, so called in the paper H(N). It's a nice approach to the problem they formulated in the paragraph below:

A major challenge of label-free quantification with prefractionation is that separate sample processing inevitably introduces differences in the fractions to be compared. In principle, correct normalization of each fraction can eliminate this error. However, the total peptide ion signals, necessary in order to perform normalization of the LC MS/MS runs of each fraction, are spread over several adjacent runs. Therefore one cannot sum up the peptide ion signals before one knows the normalization coefficients for each fraction.

I've had a look at the MSnbase documentation, and the issue of sample fractions is mentioned in two sections:

  1. https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.html#2_data_structure_and_content regarding how to load multiple spectra files in either mzData, mzXML and mzML formats using readMSData()

  2. https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.html#13_combining_msnset_instances regarding how to combine multiple runs of the same sample i.e technical replicates if I understood correctly.

However, in none of them is this specific issue of cross-fraction XIC normalization addressed.

Is there any readily available function in the package that could read a table similar to the moFF output:

Protein(s) Sequence CondtionA_Fraction1 ConditionA_Fraction2
PXXXX Peptide 1 XIC1_A_1 XIC1_A_2
PYYYY Peptide 2 XIC2_A_1 XIC2_A_2
... ... ... ...

where rows represent peptides and columns represent

  1. the inferred precursor protein
  2. the peptide sequence
  3. and all after it: XICs for each individual run (defined by a combination of condition/treatment, replicate and fraction)

Enabling this feature would make it possible to connect moFF output to MSnbase in a straightforward way in datasets where sample fractionation across runs is present.

Thank you very much for your help beforehand

Cheers, Antonio

lgatto commented 6 years ago

Dear @antortjim - thank you for your issue. I will look at it next week (can't do right now), but I definitely like the idea to make moFF and MSnbase talk to each other.

antortjim commented 6 years ago

Dear @lgatto, thanks for answering so fast!

In the meantime I am trying out https://github.com/statOmics/MSqRob (which depends on MSnBase for the data structures), and I implemented a custom L-M minimisation with Python's scipy. I am not very sure it is correct though, but I can share it.

Cheers Antonio

lgatto commented 6 years ago

Dear @antortjim - thank you for your issue. I meant to get back to you already a long time ago but than forgot. It would be great to hear more about your L-M minimisation script.

antortjim commented 6 years ago

Dear @lgatto - thanks for looking back into this! I guess it's summer ;)

I don't have time this week as I am submitting my Master thesis on Monday, but next week I have plenty to try get it to work. I am excited to at least attempt to contribute. I will explain it here step by step.

Imports the required modules including the scipy.optimize.root() function that brings the LM optimizer. Read in input arguments with argparse:

# Given a tsv file containing protein ratios, this script performs LM optimization to reach peptide intensities across samples
# Fractions are aggregated
# Ports the equations from http://www.mcponline.org/content/13/9/2513.full.pdf+html

import pandas as pd
import numpy as np
from scipy.optimize import root
from tqdm import tqdm
import argparse
import os.path
import itertools
import subprocess
parser = argparse.ArgumentParser()
parser.add_argument("--exp_design", required=True, help = "Path to table indicating the experimental design")
parser.add_argument("--moff", required=True, help = "Path to peptide_summary_intensity_moFF_run.tab, the moFF output")
parser.add_argument("--n_peps", required=False, help="Set number of peptides to use in the normalization for debugging purposes", type=int)
arguments = vars(parser.parse_args())

output_moff = pd.read_csv(arguments["moff"], sep = "\t")
experimental_design = pd.read_csv(arguments["exp_design", sep = "\t")
experimental_design.sort_values(by = ["Group", "Replicate"], inplace=True)

-Map every column in the moFF file (one per RAW file) to a sample i.e H1-L3 in the exp_rep_rowdict. -Store every possible pairwise-combination of samples in the exp_rep_combinations list.

# A list of experiment + replicate combinations (H1, L1, H2, L2, etc)
exp_rep = experimental_design["Experiment"].map(str) + "_" + experimental_design["Replicate"].map(str)
# Dictionary storing the indices of each experiment_replicate in the experimental design
# i.e exp_rep_row[H_1] is a list whith indices leading to the columns that belong to the fractions of H_1 (er_indices)
exp_rep_row = {e_r: np.where(exp_rep == e_r)[0] for e_r in exp_rep}

# Make it a list so that the len can be computed a priori
exp_rep_combinations = list(itertools.combinations(np.unique(exp_rep), 2))

Use the first n_peps peptides if it is not None

n = output_moff.shape[0]
print("Found {} peptides".format(n))
if arguments["n_peps"] is not None:
    n = arguments["n_peps"]

print("Minimising for {} peptides".format(n))

xics_values = output_moff.iloc[:n, 2:].values

Define the equation to be minimised as a function of N, the normalizing factors. It returns a list H which is equivalent to the result of the inner summatorial in eq 2 of the paper, i.e the outer summatorial is not necessary to run the minimiser.

def fun_xic_norm(N):
    # XICs for peptide P (all samples and fractions are in the same row)
    # I_{P,A} will be the sum of the product of XIC*N where condition = A
    # I_{P,A} is stored under key A of the I_P dictionary
    # H_P is the sum of the square of the absolute value of the logarithm of the ratio between all possible combinations of conditions
    # The combination is taken into account if neither I_PA nor I_PB are 0, otherwise it is ignored
    normalized_xics = xics_values * N

    # EQUATION 1 IN THE PAPER
    # Create a dictionary I with a key for each experiment_replicate
    # The value of the key is set to the result of extracting the xic for samples belonging to e_r
    # and making a colSum. (np.sum(x, axis=1))
    # The value is then an array of length equal to the number of peptides where each ith cell
    # is the sum of the normalized intensities for peptide i in e_r
    I = {e_r: np.sum(normalized_xics[:, er_indices], axis=1) for e_r, er_indices in exp_rep_row.items()}

    # EQUATION 2 IN THE PAPER
    # For every combination, take the logarithm of the intensity ratio for all peptides
    # Take the absolute value, and square it
    # Each value in the array is now a summand in the eq 2 of the paper
    # The array has shape number of combinations x number of peptides i.e
    # the i,j cell will contain the square log peptide intensity ratio for peptide j in combination i.
    # Sum all the ratios column wise (i.e for the same combination)
    # We get an H for each comparison.
    H = np.sum(np.array([np.square(np.abs(np.log(I[c[0]] / I[c[1]]))) for c in exp_rep_combinations]), axis = 0)
    # Remove non valid values
    H = H[~np.isnan(H)]
    H = H[np.isfinite(H)]
    return H

Constrain the values that the normalizing factors can take to lower and upper

#https://stackoverflow.com/questions/43995862/python-multivariate-non-linear-solver-with-constraints?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
def constrainedFunction(x, f, lower, upper, minIncr=0.001):
     x = np.asarray(x)
     lower = np.asarray(lower)
     upper = np.asarray(upper)
     xBorder = np.where(x<lower, lower, x)
     xBorder = np.where(x>upper, upper,
[lm_norm.py.zip](https://github.com/lgatto/MSnbase/files/2245189/lm_norm.py.zip)
 xBorder)
     fBorder = f(xBorder)
     distFromBorder = (np.sum(np.where(x<lower, lower-x, 0.)) + np.sum(np.where(x>upper, x-upper, 0.)))
     return (fBorder + (fBorder + np.where(fBorder>0, minIncr, -minIncr))*distFromBorder)

Find the root (=minimise) using the lm method (Levenberg–Marquardt), constraining each root to the 0-20 interval. The starting value is init_value = 1 The result is a list of n_samples (number of RAW files) numbers where every ith number is the normalizing factor for the ith RAW file. It is saved to normalization_factors.txt.

Code not here: They can be used to multiply each column in the moFF file and then sum those representing the different fractions of the same sample. The final result is a new table with the number of columns equal to the number of conditions and replicates, 6 in this dataset (H1, H2, H3, L1, L2, L3,).

def main(minimize=True):

    n_samples = xics_values.shape[1]
    x0 = np.full(n_samples, init_value)
    if minimize:
        # find the normalization fators that minimise fun_xic_norm
        N = root(constrainedFunction, x0=x0, method="lm", args=(fun_xic_norm, np.array([0,]*n_samples), np.array([20,]*n_samples))).x
        print("Saving norm factors to file")
        np.savetxt(fname="normalization_factors.txt", X=N, fmt="%10.5f", delimiter="\t")

    else:
        N = x0

if __name__ == "__main__":
    init_value = 1
    main(True)
    main(False)

Attached a zip with the full Python script. lm_norm.py.zip

antortjim commented 6 years ago

I forgot to say that I don't think it is correct as I don't get the right results. See my second comment here https://github.com/statOmics/MSqRob/issues/49 MSqRob does a pretty good job by considering the fractions a random effect. But there's still many unquantified E. coli proteins.