lgatto / synapter

Label-free data analysis pipeline for optimal identification and quantitation
https://lgatto.github.io/synapter/
4 stars 2 forks source link

intensity normalisation by LOWESS #116

Closed pavel-shliaha closed 7 years ago

pavel-shliaha commented 8 years ago

Apparently a lot of label-free algorithms these days have a normalisation functionality for intensity throughout the run, since there is a systematic drift of intensities at different timepoints of acquisition: i.e. for example that at 10min of quant run all peptides are 0.9 or 1.2 in intensity than in identification run

to counter it I suggest ab approach similar to the one in isoquant (see below):

image

I think we should do the following:

1) during master construction, peptides being added to the master should be automatically intensity adjusted. I.e. when a peptide is added to master not only its retention time, but also its intensity is adjusted according to the master acquisition. 2) during identification transfer from master, a separate column is created with a value to multiply by to correct for intensity. Then AFTER saturation correction a user can multiply the exprs() values by the values in the correction column.

I think it is important to do that, since it will decrease CVs of quantitation and so increase synapter attractiveness to potential users.

sgibb commented 8 years ago

Sry, for being unresponsive for such a long time.

I need some clarifications:

  1. During master creation: How to correct the intensities, here? Currently we just ignore them. BTW: this will also affect the fragment library (where we use some kind of mean intensity).
  2. What is an "average peak cluster intensity"?
  3. Would it be possible to create the LOWESS model after saturation correction (or the step before; we could introduce some bias because we already filtered a lot of stuff, on the other hand these unidentified peptides produce a lot of noise)?
pavel-shliaha commented 8 years ago

most important point: THINK ABOUT DRIFT IN INTENSITY IN THE EXACT SAME WAY YOU THINK ABOUT DRIFT IN RETENTION TIME

To answer you questions:

2) average peak cluster intensity is average intensity within a given period of time, e.g.: you can find that all peptides at timepoint 10.5-11min have an average ratio of 0.9 to the master. This can happen if there was smth wrong with the spray at this particular part of chromatogramme. This is conceptually identical to the drift in retention time and I think it is very easy to implement because you can completely reuse the retention time correction functionality, but plot intensity_master/intensity_quantitation_run against retention time of master.

1) as with retention time correction when creating master you need to adjust intensities before adding new ids to master

3) you should not correct intensities before detector saturation correction, but after, hence I suggest we simply create a new column and save this information there

sgibb commented 8 years ago

Currently we don't store any peptide intensity information in the master object (just the precursor.inten). The master is created from the final peptide files. The following intensity columns would be available: peptide.MatchedProductsSumInten, peptide.RelIntensity, precursor.inten, precursor.calcInten

Is peptide.MatchedProducsSumInten the correct column?

Or do we need to import PEP3D files as well?

sgibb commented 8 years ago

Will that working at all (especially for this dataset)? The Kuharev dataset contains 5 % E. coli and 30 % Yeast in sample A and 20 % E.coli and 15% Yeast in sample B. If we select sample A as master and B as slave (quant) run we expect an intensity ratio of 1/4 for E.coli and 2/1 in Yeast. In the worst case we have the same number of proteins indentified for both and lowess would pick an average of the both (1.25) and will destroy the whole intensity information. Could that be?

Here you can see an example plot using S130423_05 (master; A) and S130423_14 (slave; B) (using precursor.inten as intensity, peptide.MatchedProductsSumInten doesn't change this pattern), the red line is the lowess fit, the other lines are the theoretical ratios:

rplot001

rplot002

Did I understand everything wrong? Or do we have to ensure that just technical replicates (A and A; B and B) are normalized against each other?

pavel-shliaha commented 8 years ago

Exactly; normalisation will only work if the majority of the proteins is at ratio 1:1, this is not true for samples B and A, hence they should not be normalised against each other.

so the master should also be created separately for A and B if the functionality is to be used.

sgibb commented 8 years ago

Sorry I made a copy and paste error (the LOWESS line in the image above is the retention time vs. diff(retention time model). Here is the correct image:

rplot001

As we discussed in our skype chat it could work for A and B samples if the majority of the intensity values have a ratio master/slave around 1. In this image you could see that 65 % human could be enough.

Nevertheless after I implemented this feature we should have a look at two different workflows:

Just for completeness the same model for sample A (S130423_05) vs sample A (S130423_07): rplot001

The same samples after applying the correction: rplot002

BTW: What happens around retention time = 145 min?

pavel-shliaha commented 8 years ago

Thanks! Will play around with this over the weekend. Wanted to ask you to log2 ratios but you've done it already I presume?

Also around 145 min you see bad spray, probably.

sgibb commented 8 years ago

Thanks! Will play around with this over the weekend.

It is not ready yet. Currently it just corrects the intensities during master creation. There is no correction for ident/quant. I drop you a note when you can start testing.

I am wondering where to do the ident/quant intensity correction. It should be possible to run it as additional step after the whole synpater workflow because we store the precursor.inten, precursor.retT columns from the identification run and add the Counts column from the quantitation run. But is that reasonable? @pavel-shliaha Do you have any suggestions?

Wanted to ask you to log2 ratios but you've done it already I presume?

Already done.

There is a problem for the user: the current master creation is a black box. The user hasn't any chance to see a plot showing the influence of the span/maxDeltaRt argument (the above plots are handmade).

Do we need more plots? Do we need a spanRet and spanInten argument to use different span for the two loess models?

pavel-shliaha commented 8 years ago

I am wondering where to do the ident/quant intensity correction. It should be possible to run it as additional step after the whole synpater workflow because we store the precursor.inten, precursor.retT columns from the identification run and add the Counts column from the quantitation run. But is that reasonable? @pavel-shliaha Do you have any suggestions?

during master creation, intensity should be corrected automatically, BUT during master to quant identification transfer I would prefer to just create a column which a user can then use in MSnBase, since it allows correction after detector saturation correction.

sgibb commented 8 years ago

I understand but this column could be created at any time, or? I just not sure when it is reasonable (e.g. create the column after findEMRTs or don't create this column in synapter at all but on the final MSnSet (because we have all the necessary information in the fData).

Currently I think the easiest would be to do this at MSnSet level (at least for testing and playing around).

pavel-shliaha commented 8 years ago

I think this should be created when an explicit command is called, after merging the commonly identified peptides (just like modelling retention time, it will allow to keep track of the pipeline).

sgibb commented 8 years ago

Now a basic version of intensity correction is implemented.

Just run synapter as usual. Before you transform the synapter object into a MSnSet call modelIntensity. This will add a new column intensityCorrectionFactor to the MatchedEMRTs data.frame. After as(x, "MSnSet") you can access this column via fData(msnset)$intensityCorrectionFactor.

After combining multiple MSnSet objects and applying saturation correction just run correctIntensity(msnset) to multiply the intensity values by the correction factor, e.g.:

# synapter workflow
# ...

modelIntensity(synapterObj)

msnset <- as(synapterObj, "MSnSet")

## do stuff with msnset

correctedMSnSet <- correctIntensities(msnset)

I am somehow puzzled whether it is correct to multiply/divide by correction factor. According to the loess model and the input variables it should be multiplication (but in my test cases it seems to make the results worse and division looks more correct). I am currently too tired to follow this (I am on night shift this week). @pavel-shliaha if you see that division is correct I am very sorry but it is just a change of one sign (/ vs *) in MSnbase-extension.R:.correctIntensity.

There is still a lot of stuff missing:

If your tests are promising I will add the documentation and we can discuss the other points.

P.S.: Please note that you have to regenerate the master as well (there the intensity model is applied without additional user input (using the retention time span)).

sgibb commented 8 years ago

The current calculation for master is:

mergedPeptideData$intenRratio <- log2(mergedPeptideData$precursor.inten.master / 
                                      mergedPeptideData$precursor.inten.slave)

intModel <- loessModel(mergedPeptideData$precursor.retT.slave,
                           mergedPeptideData$intenRatio, span=span)

slaves[[i]]$IdentPeptideData$precursor.inten <-
      slaves[[i]]$IdentPeptideData$precursor.inten *
      2^(predict(intModel, slaves[[i]]$IdentPeptideData$precursor.retT))

for quant vs master (ident):

ratio <- log2(.self$MergedFeatures$precursor.inten.ident / 
              .self$MergedFeatures$precursor.inten.quant)
intModel <- loessModel(.self$MergedFeatures$precursor.retT.ident, ratio, span = span)

.self$MatchedEMRTs$intensityCorrectionFactor <- 
    2^(predict(intModel, .self$MatchedEMRTs$precursor.retT))

Counts * intensityCorrectionFactor
pavel-shliaha commented 8 years ago

cool is there a function to visualise this (like the figures shown above?)

pavel-shliaha commented 8 years ago

hey guys it works! I get lower CVs with normalisation when LOWESS correction is used than when I use normalisation post run. However synapter complains that correctIntensities does not exist instead I used:

exprs (combMSNSet) <- exprs (combMSNSet) * 
    (2^fData (combMSNSet)[, grep("intensityCorrectionFactor", colnames (fData (combMSNSet)))])

@sgibb could you please have a look at that?

sgibb commented 8 years ago

The name of the function is: correctIntensity (note the y instead of the ies at the end).

While I am happy to hear that it seems working I guess the effect is due to the master intensity correction. Because your approach should produce a lot of wrong values. The intensityCorrectionFactor column contains the intensity correction factor (meaning it is already 2^x). The correct way of doing it:

i <- grep("intensityCorrectionFactor", fvarLabels(msnset))
exprs(msnset) <- exprs(msnset) * as.matrix(fData(msnset)[,  i, drop = FALSE])

As stated above I am not completely sure about the * operator (maybe we have to use /). I am going to add plotting functions. These will show wrongly used * operators.

sgibb commented 8 years ago

So, there is a new function plotIntensity (it is nearly identical to plotRt):

s130423_07 rds_003 s130423_07 rds_004 s130423_07 rds_005 s130423_08 rds_003 s130423_08 rds_004 s130423_08 rds_005

I used a span=0.05 (default for rt alignment) but it seems too flexible. (the raw/corrected (red/blue) plots are not available in synapter and created from the MergedFeatures data.frame; and correctIntensity correctly uses * instead of / )

pavel-shliaha commented 8 years ago

getting the following error:

synapterAnalysis <- readRDS ("Y:/RAW/pvs22/_QTOF_DATA_data3/synapter2paper/kuharev2015/synapter2_intensity/output/UDMSE//S130423_05//synapterAnalysis.RDS")
modelIntensity(synapterAnalysis )

Error in model.frame.default(formula = y ~ x) : invalid type (NULL) for variable 'y'

can you please have a look?

sgibb commented 8 years ago

14 days ago I decided to store calculate the intensity ratio intenRatio during the merging process (in the same manner as deltaRt is stored: https://github.com/lgatto/synapter/commit/6e128f1b381361f89bf0eb5daf5c30bd6e46cb1f). So it seems that your synapterAnalysis was created before and intenRatio is indeed missing. Please try to recreate synapterAnalysis with the latest synapter.

Before we create the final manuscript we have to ensure that we are using the most recent synapter version for all scripts and figures!

pavel-shliaha commented 8 years ago

The plotIntensity() works with the new synapter version after recreating synapter objects. But there are two problems:

1) when I run plotIntensity (synapterAnalysis, what = "model")

I get the following error: Error in plot.window(...) : need finite 'ylim' values In addition: Warning messages: 1: In min(x) : no non-missing arguments to min; returning Inf 2: In max(x) : no non-missing arguments to max; returning -Inf

there is a plot with model above so I suppose it is an issue of this particular analysis. @sgibb I have shared the file with you through google drive

2) is it possible to increase points transperancy on the plots? Shown below is what I got and I think the plot is not very clear since all the points are on top of each other

synapter_intensity

sgibb commented 7 years ago

I think 1. is a similar/the same problem as you described in: https://github.com/lgatto/synapter/issues/39#issuecomment-248502081 and solved in: https://github.com/lgatto/synapter/issues/39#issuecomment-248504276

Regarding the colour of the points: Currently we use an alpha (transparency) of 0.5 (1 means no transparency and 0 no colour). IMHO we should not decrease alpha (increase transparency) because it makes the colours very blurry. But the user can set the colour:

alpha <- c(1.0, 0.5, 0.1)
for (a in alpha) {
  plotRt(s, what="data", col=rgb(red=1, green=0, blue=0, alpha=a))
  title(paste("Alpha:", a))
}

rplot001 rplot002 rplot003