lgatto / synapter

Label-free data analysis pipeline for optimal identification and quantitation

https://lgatto.github.io/synapter/

4 stars 2 forks source link

Correct saturation by only quantifying on unsaturated isotopes in all samples under comparison #39

Closed pavel-shliaha closed 7 years ago

pavel-shliaha commented 11 years ago

The prototype of suggested function:

requantify <- function (synapterObj , satThreshold, minIsotopes)

synapterObj - list of synapter objects satThreshold - intensity over which accurate ion recording is not possible, due to saturation minIsotopes - minimum number of isotopes accepted for requantitation

For every peptide the function will find common unsaturated ions from supplied synapter objects, and requqntify EMRT using these ions only. satThreshold is used to determine which ions saturate. minIsotopes is the minimum number of isotopes that are sufficient to requantify (i.e. sometimes only few isotopes will be seen in all samples under saturation and over LOD)

Important: the information on peptide isotopes is not presented in current synapter objects! Pep3D file that is loaded in synapter is filtered so that for each EMRT only one isotope is preserved. This means that synapter object will have to be modified to include this information. Current consensus is that a list should be created each element of which represents a single EMRT and contains information on ion intensities from those EMRTs

pavel-shliaha commented 9 years ago

Laurent asked me to create a comment here:

I have had a look at the data and noticed that the saturation does not affect fragment quantities and lower abundance isotope. I would suggest the following easy to implement soultion:

1) when you load the P3D files keep only a single line per EMRT, but also record isotopic distribution of each EMRT. For example: write z2i0_1000_z2i1_500_z2i2_100_z3i0_200 which corresponds to a list of ions: charge 2 isotope with C13 = 0 has intensity of 1000, charge 2 isotope C13 = 1 with intensity of 500, charge 2 isotope C13 = 1 with intensity of 100, charge 3 isotope C13 = 0 with intensity of 200. Kepp this line for every EMRT

2) when matching fragments write the same but in form fragment_intensity, for example: b19_1000_y3_500.

Once you have applied a function of combining individual synapter analysis into MSnsets, you can apply a command of correcting saturation either on fragments or isotopes, by saying quantify on all fragments or isotopes which intensity is below certain threshold. Note that this is done AFTER several synapter analysis are combined into an MSnSet.

sgibb commented 9 years ago

As we discussed on skype we will follow the first strategy (quantify on peptides). Now the isotopic distribution is stored in the Synapter object and the resulting MSnSet:

## isotopicDistr is formated: ion_z,ion_iso,ion_counts
tail(cbind(exprs(m), fData(m)["isotopicDistr"]))
#                   Synapter1                                                        isotopicDistr
# IVDTWR                 1198                                               2,0,838;2,1,360;1,0,30
# ESASHETPTDLRPVIPR       555                                        3,0,275;3,1,280;1,0,56;1,1,43
# GYDLPDEVFDENEK          775                                       2,0,331;2,1,293;2,2,151;1,0,37
# IFGLEK                64706 1,0,42448;1,1,17358;1,2,2956;1,3,417;2,0,1037;2,1,490;1,0,118;1,1,83
# LATMNIDIDLTSGGK         414                                              3,0,218;3,1,196;1,0,122
# QNLFNNTVK              1266                                       2,0,703;2,1,399;2,2,164;1,0,33

How we realize the requantification? Do we simple sum up all counts below a threshold, e.g. for the last peptide: satThreshold = 500; newCount = 399 + 164 + 33 (596)?

pavel-shliaha commented 9 years ago

To do requantification synapter needs to first select all ions that are present in all the runs under saturation threshold. It then sums the intensity of these ions for every run, which is the new quantitation value. For example:

run1: 2,0,838;2,1,360;1,0,30 run2: 2,0,8000;2,1,4000;1,0,30

saturation threshold: 5000

there is 1 isotope present in all runs below saturation threshold: 2.1. so the new quantitation value is now 360 + 30: 4000 + 30

sgibb commented 9 years ago

So we just sum up common isotopes. For your example run 1 will be (360+30)/(4000+30) = 0.097 and run 2 will be 1.0? (these values are no counts anymore)

What do we do if there are

no isotopic distribution information avaiable (e.g. rescued EMRTs)
no counts below the saturation threshold
no common isotopes?

pavel-shliaha commented 9 years ago

1) this will have to be loaded unfortunately from the P3D file 2) there will always be isotopes below saturation threshold. THe number of identified isotopes is expected to be correlated with peptide intensity. The more there is of a peptide, the more isotopes it will have over a saturation threshold. I.e. for high intensity peptides (above saturation threshold), there will be more isotopes recorded than for a low intensity peptide and these isotopes will be below saturation threshold 3) this situation is more complicated. Consider this: the peptide is identified in two runs in the first run it is high intensity and has 5 isotopes identified:

z2i0 - 10.000 z2i1 - 8000 z2i3 - 5000 z2i4 - 2000 z2i5 - 1000

for the second run the peptide intensity is low and it has the following distribution

z2i0 - 5000 z2i1 - 2000

saturation threshold is 6000. Now this becomes a problem since there are no peptide ions that are common that are below saturation threshold. The only option here is to predict the correct intensity of ions above saturation threshold, by using ions below saturation threshold in run 1. There is a simple function that can generate isotopic distribution and then we say given that

z2i3 - 5000 z2i4 - 2000 z2i5 - 1000, we should actually see z2i0 at 15000 and z2i1 at 10000. It should not be much work since the function is very simple (it is based on sample). See IsotopicDistributionN {OrgMassSpecR} for an example

sgibb commented 9 years ago

1) this will have to be loaded unfortunately from the P3D file

That's a pain! I will do it later if the whole method works as expected.

Regarding 3.: In theory it sounds simple (like always) but here I used the last three peptides of my example dataset and try to calculate the original counts using theoretical proportions of the isotopes (ignoring any saturation). As you can see the calculated values are sometimes very good and sometimes (especially for the less abundant isotopes) very bad.

# I changed the isotopic distribution format for easier parsing, order is equal
GYDLPDEVFDENEK      775                                       2_0:331;2_1:293;2_2:151;1_0:37
IFGLEK            64706 1_0:42448;1_1:17358;1_2:2956;1_3:417;2_0:1037;2_1:490;1_0:118;1_1:83
QNLFNNTVK          1266                                       2_0:703;2_1:399;2_2:164;1_0:33

Using the BRAIN package (which offers a similar function as OrgMassSpecR::IsotopicDistributionN but is a lot of faster) I get the following results:

library("BRAIN")
(props <- calculateIsotopicProbabilities(getAtomsFromSeq("GYDLPDEVFDENEK"), nrPeaks=3))
# [1] 0.3958128 0.3450117 0.1722195
c("2_0"=331, "2_1"=293, "2_2"=151)/props
#      2_0      2_1      2_2 
# 836.2540 849.2465 876.7883 
# original counts are 775

(props <- calculateIsotopicProbabilities(getAtomsFromSeq("IFGLEK"), nrPeaks=4))
# [1] 0.65718493 0.26507158 0.06427314 0.01155455
c("1_0"=42448, "1_1"=17358, "1_2"=2956, "1_3"=417)/props
#      1_0      1_1      1_2      1_3 
# 64590.65 65484.20 45991.22 36089.69
# original counts are 64706

(props <- calculateIsotopicProbabilities(getAtomsFromSeq("QNLFNNTVK"), nrPeaks=3))
# [1] 0.5474634 0.3145201 0.1056640
c("2_0"=703, "2_1"=399, "2_2"=164)/props
#      2_0      2_1      2_2 
# 1284.104 1268.599 1552.090
# original counts are 1266

If I don't know the true counts (because I assume they are affected by saturation) I would tent to use the less abundant isotopes (because it is less likely that they are saturated) to calculate the "original" counts. But as shown above these less abundant isotopes results in very wired counts. Slowly I doubt that it would be possible to estimate the true counts in a reasonable way.

BTW: Would it be save to sum up the counts of equal isotopes of different charge states? (e.g. 1_1 + 2_1 + 3_1) If that would be possible (as I assume) the results will differ even more.

pavel-shliaha commented 9 years ago

I understand the difficulty and it is expected. Perhaps it is indeed going to be complicated to calculate the intensity of saturated ions based on non-saturated ions, due to the fact that there is a more significant noise contribution to signal for less intense isotopes.

Also perhaps the situation outlined in my previous post is extremely rare and we dont even need this calculation. So I would prefer to have 2 separate functions: 1st - requantify with lower intensity isotopes, 2nd - predict intensity of higher abundant isotopes and requantify then.

The bottom line is I can only test whether this functionality is useful once it is available to me. Put it in synapter and I will test it using the saturation dataset we have,

sgibb commented 9 years ago

Ok, I just finished the first attempt to realize this feature. It is still clumsy but I hope it is enough to start experimenting with it.

default workflow

## start a new, clean R session
## install devtools if you haven't done it before
install.packages("devtools")
library("devtools")

## get the latest synapter with requantification support
install_github("lgatto/synapter@saturation")
library("synapter")

## load synapterdata example files
path <- system.file("extdata", package="synapterdata")

identfiles <- list.files(path, pattern="^HDMSe", full.names=TRUE) 
quantfiles <- list.files(path, pattern="^MSe.*IA", full.names=TRUE) 
pep3dfiles <- list.files(path, pattern="Pep3DAMRT", full.names=TRUE) 
fasta <- unzip(file.path(path, "EcoliK12_enolase_UPSsimga_NB.fasta.zip"))

set.seed(1)
s <- vector(mode="list", length=length(identfiles))

## run standard synapter workflow
system.time(
for (i in seq(along=s)) {
  s[[i]] <- Synapter(list(identpeptide=identfiles[i],
                          quantpeptide=quantfiles[i],
                          quantpep3d=pep3dfiles[i],
                          fasta=fasta))
  filterUniqueDbPeptides(s[[i]])
  filterIdentPepScore(s[[i]], method="BH", fdr=0.05)
  filterQuantPepScore(s[[i]], method="BH", fdr=0.05)
  filterIdentProtFpr(s[[i]], fpr=0.05)
  filterQuantProtFpr(s[[i]], fpr=0.05)
  filterIdentPpmError(s[[i]], ppm=20)
  filterQuantPpmError(s[[i]], ppm=20)
  mergePeptides(s[[i]])
  modelRt(s[[i]], span=0.05)
  searchGrid(s[[i]])
  setBestGridParams(s[[i]])
  findEMRTs(s[[i]])
})

## create sample names
sets <- paste0("ups", sub("^.*([0-9]{2,})fmol_.*([0-9]{2,})_IA.*", 
                          "\\1.\\2", basename(identfiles)))

## create msnsets
ms <- lapply(s, as, Class="MSnSet")
## update sampleNames/fvarLables of the msnsets
ms <- mapply(function(m, s) {
              sampleNames(m) <- s;
              m <- updateFvarLabels(m, s)
              m
}, m=ms, s=sets)

## combine msnsets
system.time(
  msc <- Reduce(combine, ms)
)

# ignore peptides with no isotopicDistr data for now (rescued)
#i <- which(rowSums(!is.na(fData(msc)[, grep("isotopicDistr", fvarLabels(msc))])) == ncol(msc))
#
#msc <- msc[i]

Nothing new yet.

requantification

The next step would be the requantification. There is a new function called requantify with three arguments (the MSnSet; saturationThreshold: the maximal unsaturated counts, method: could be "sum": the sum of the common isotopes below the saturationThreshold; "th.mean"/"th.median"/"th.weighted.mean": use the mean/median/weighted.median of the isotopes below the saturationThreshold multiplied with their theoretical probability of occurrence; the latter need the BRAIN package installed (call ìnstall.packages("BRAIN") before)).

summed <- requantify(msc, saturationThreshold=Inf, method="sum")
thmean <- requantify(msc, saturationThreshold=Inf, method="th.mean")
thmedian <- requantify(msc, saturationThreshold=Inf, method="th.median")
thwmean <- requantify(msc, saturationThreshold=Inf, method="th.weighted.mean")

synapter-plgs agreement

Another new function is the synapterPlgsAgreement discussed in https://github.com/lgatto/synapter/issues/73.

msc <- synapterPlgsAgreement(msc)
cols <- grepl("synapterPlgsAgreement|nIdentified|nAgree|nDisagree", fvarLabels(msc))
head(fData(msc)[, cols])

What is missing/what to do next:

@pavel-shliaha has to test the correctness and usefulness of these functions.
There is no support for rescued EMRTs. As @pavel-shliaha mentioned before we have to read the isotopic distribution information from the ident pep3d file. I will add this if @pavel-shliaha told me that the functions produce useful results.
There is not any documentation of these functions yet (no vignette; not even a manual page).
We have to rethink about the new function/column/argument names.
Maybe a cleaner rewrite of this "quick hack" is needed.
The functions should be methods.

@pavel-shliaha I am looking forward to your feedback.

lgatto commented 9 years ago

A little technical comment/question regarding

The functions should be methods

Do you mean requantify? Why so? Using methods when functions work well enough or there is no requirement for dispatching on different inputs is not worth it. (I have been using methods when functions where fit for purpose too often in the past)

sgibb commented 9 years ago

@lgatto: Yes, I have seen that you change some methods to functions in some of your packages. What is the reason for that? Is there any drawback of S4 methods (except that dispatching needs some time and debugging is sometimes harder)?

pavel-shliaha commented 8 years ago

cannot find requantify function. Where does it live?

pavel-shliaha commented 8 years ago

requantify(msc, saturationThreshold=Inf, method="sum") function does not work very well.

1) easy to fix: if requantification is impossible can you not introduce NAs,but just leave the same quant values. Also introduce an additionl column "requantification" where you put TRUE, if requantification was succeful and FALSE if it was not. The reason is a user might not want to loose the data if requantification was not complete.

2) do I understand correctly that you re just removing the isotopes that are above saturation threshold.

e.g.

peptide in run 1 isotopes: 100, 10, 1 peptide in run 2 isotopes: 1e4, 1e3,1e2

if saturation threshold is 1e3, do you jut remove the ions above that and get

peptide in run 1: 100 + 10 + 1 peptide in run 2: 1e2

if so than this should not be the case. You have to remove the isotopes that are above saturation threshold in one of the runs from all of the runs. So the result should be:

peptide in run : 1 peptide in run 2: 1e2

sgibb commented 8 years ago

1.) Is that really expected? I wouldn't expect that there are any original values in the MSnSet after requantification. Do you really think we should implement it the way you suggested? If the user want to mix up original and requantified values he could easily do so:

r <- requantify(msc, saturationThreshold=1e3, method="sum")
notNA <- !is.na(exprs(r))
exprs(msc)[notNA] <- exprs(r)[notNA]

2.) This is not true. The current behaviour is already the way you want it to be. Default workflow:

library("devtools")
load_all("synapter")
path <- system.file("extdata", package="synapterdata")

identfiles <- list.files(path, pattern="^HDMSe", full.names=TRUE)
quantfiles <- list.files(path, pattern="^MSe.*IA", full.names=TRUE)
pep3dfiles <- list.files(path, pattern="Pep3DAMRT", full.names=TRUE)
fasta <- unzip(file.path(path, "EcoliK12_enolase_UPSsimga_NB.fasta.zip"))

set.seed(1)
s <- vector(mode="list", length=length(identfiles))

## run standard synapter workflow
system.time(
for (i in seq(along=s)) {
  s[[i]] <- Synapter(list(identpeptide=identfiles[i],
                          quantpeptide=quantfiles[i],
                          quantpep3d=pep3dfiles[i],
                          fasta=fasta))
  filterUniqueDbPeptides(s[[i]])
  filterIdentPepScore(s[[i]], method="BH", fdr=0.05)
  filterQuantPepScore(s[[i]], method="BH", fdr=0.05)
  filterIdentProtFpr(s[[i]], fpr=0.05)
  filterQuantProtFpr(s[[i]], fpr=0.05)
  filterIdentPpmError(s[[i]], ppm=20)
  filterQuantPpmError(s[[i]], ppm=20)
  mergePeptides(s[[i]])
  modelRt(s[[i]], span=0.05)
  searchGrid(s[[i]])
  setBestGridParams(s[[i]])
  findEMRTs(s[[i]])
})

## create msnsets
ms <- lapply(s, as, Class="MSnSet")

## combine msnsets
system.time(
  msc <- Reduce(combine, ms)
)

To show the selected isotopes I use some internal functions here:

f <- fData(msc)[, grep("isotopicDistr", fvarLabels(msc))]
x <- .splitIsotopicDistr(unlist(f[22, ]))
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_01
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4
#  16684   6747   1885    430 137906  55855  15305   3352    549
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_03
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4
#  10610   4355   1409    190 136850  57326  16114   1718    325
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_05
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4
#   9983   4544    746    148 136386  56244  15337   3329    645
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_01
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4
#  12127   5041    990    177 148808  60998  16693   3656    577
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_03
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4    2_5
#   8180   3325    927    205 133856  54500  15275   3102    498     81
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_05
#    1_0    1_1    1_2    1_3    2_0    2_1    2_2    2_3    2_4
#  10893   4425   1365    276 141491  58884  16458   3489    345
#
.commonIsotopes(x, saturationThreshold=1e2)
# character(0)
.commonIsotopes(x, saturationThreshold=1e3)
# [1] "1_3" "2_4"

# 1_2 in run 1 is the only run where 1_2 > 1800 and all 1_2 isotopes are filtered
.commonIsotopes(x, saturationThreshold=1.8e3)
# [1] "1_3" "2_4"
.commonIsotopes(x, saturationThreshold=1.9e3)
# [1] "1_2" "1_3" "2_4"

Question: The current behaviour has some disadvantages. E.g. the third EMRT was seen in 5 of 6 runs with a good isotopic distribution. Because we require that the isotopes must be present in all runs (even when the EMRT was not observed at all), we could not find any common isotopes (because the first run has not any isotope) and could not requantify this EMRT:

x <- .splitIsotopicDistr(unlist(f[3, ]))
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_01
# <NA>
#   NA
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_03
#   1_0   1_1   1_2   1_3   2_0   2_1   2_2   2_3   2_4
# 10256  6073  1915   450 80564 48499 16732  4034   852
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_05
#   1_0   1_1   1_2   1_3   2_0   2_1   2_2   2_3   2_4   2_5
#  9710  5588  1699   426 86293 46817 14378  3842   847   264
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_01
#   1_0   1_1   1_2   1_3   2_0   2_1   2_2   2_3   2_4   2_5
#  9443  5068  1716   423 87367 47123 16806  4317   992   130
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_03
#   1_0   1_1   1_2   1_3   2_0   2_1   2_2   2_3   2_4   2_5
# 10260  5859  1842   428 79717 46234 15919  3935  1026   138
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_05
#   1_0   1_1   1_2   1_3   2_0   2_1   2_2   2_3   2_4   2_5
#  8439  5036  1629   347 68177 41550 14343  3415   751   122
#
.commonIsotopes(x, saturationThreshold=Inf)
# character(0)

I would suggest to ignore NA runs (where the EMRT was not observed) and requantify all the other runs (so in this example 2 runs out of 6 would be enough for requantification).

pavel-shliaha commented 8 years ago

I would actually like to suggest the following method to deal with the problem:

imagine we have a peptide at 10, 20, 50, 100 and 500 fmol (this is actually what we see in the organelle proteomics datasets a lot), the problem is that 10 and 500 fmol do not have any isotopes in common that are not over sat threshold in 500, but are still present at 10, so correction is not possible. 50 fmol however has common isotopes with both 10 and 500fmol, so we could quantify 10 vs 50 and 50 vs 500 to solve this problem. Thus my suggestion is:

1) nominate a run where all isotopes are under sat threshold (reference run). If there is several such runs than we could either:

use the run with the overall peptide highest intensity (e.g. if we have 20 and 50, take 50)
use all runs and find an average

2) requantify only those runs that have ions over sat threshold, i.e. leave 10 as it is, but requantify only 500.

sgibb commented 8 years ago

at 1) I would suggest to let the user choose a reference run and don't try to find a run automatically. (If we try to find a run automatically it could happen that the reference run differs across peptides. Or we have to analyse all runs at once.) at 2) that's sounds reasonable. Additionally I would suggest to ignore NA runs.

sgibb commented 8 years ago

As discussed in our last skype call I add a new requantify method: "reference".

It looks for an unsaturated run (none of the isotopes is higher than the saturation threshold) with the highest summed intensity value and use this as reference. A run with intensity values above the saturation threshold would be corrected as follows:

The unsaturated intensities are divided by their counterparts of the reference run. We get a scale factor for each isotope.
We calculate the mean scale factor.
We estimate the (unsaturated) intensities by multiplying the scale factor by the intensities of the reference run.

Example:

reference= "1_0:2000;1_1:200;1_2:20;1_3:2;"
satrun   = "1_0:6000;1_1:1000;1_2:100;1_3:10;1_4:1"

satThreshold = 5000

# 1. step
satrun[2:4]/reference[2:4] # 1000/200, 100/20, 10/2
# 5, 5, 5

# 2. step
scaleFactor = mean(c(5, 5, 5)) # 5

# 3. step 
unsatInt = reference[1]*scaleFactor # 10000
correctedInt = sum(10000, 1000, 100, 10, 1) # 11111

A more complete example (ref is run 3 and sat is run 4)

f <- data.frame(isotopicDistr.run1="1_0:1000;1_1:100;1_2:10;1_3:1;",
                isotopicDistr.run2="1_0:800;1_1:100;1_2:10;1_3:1;",
                isotopicDistr.run3="1_0:2000;1_1:200;1_2:20;1_3:2;",
                isotopicDistr.run4="1_0:6000;1_1:1000;1_2:100;1_3:10;1_4:1",
                stringsAsFactors=FALSE)

synapter:::.requantifyReferenceRun(f, saturationThreshold=Inf)
# isotopicDistr.run1 isotopicDistr.run2 isotopicDistr.run3 isotopicDistr.run4
#               1111                911               2222               7111
synapter:::.requantifyReferenceRun(f, saturationThreshold=5000)
# isotopicDistr.run1 isotopicDistr.run2 isotopicDistr.run3 isotopicDistr.run4
#               1111                911               2222              11111

The user visible function (msc object from the example code some comments above):

refcorrected <- requantify(msc, saturationThreshold=5000, method="reference")

sgibb commented 8 years ago

We need to review all saturation correction algorithms. Even "sum" has its bugs.

"Sum" (maybe others too) should get an option to decide whether we want to remove unsaturated isotopes that are below the threshold but not present in all runs (current behaviour) or if we want to keep them:

run1 z1:10, z2:5, z3:1; run2 z2:5
threshold: 20: run1: 15 (or 16, z3 is missing in run2), run2: 5
threshold: 8: run1: 5 (or 6?), run2: 5

sgibb commented 8 years ago

A question regarding reference run requantification:

We have the following situation:

reference run: z1_1: 100, z1_2: 10, z1_3
run500mol:     z1_1: 2000, z1_2: 1000, z1_3: 500, z2_1: 100, z2_2: 10

If the saturation threshold is 400, the reference run and run500mol have not any unsaturated isotope in common. How we handle this?

Sounds strange? But happens at least 2 times in your example data set, e.g. for GPFPIIV (second row, second plot):

ions_reference6

pavel-shliaha commented 8 years ago

"If the saturation threshold is 400, the reference run and run500mol have not any unsaturated isotope in common. How we handle this?"

you cant hadle this by reference approach. First some definitions so we are on the same page. Assuming we are talking about a particular peptide: SATURATED runs are those runs that have isotopes over saturation threshold. UNSATURATED runs are those runs that have isotopes below certain threshold. REFERENCE run is unsaturated run with the highest intensity.

The idea of SUM method is to use common isotopes below saturation threshold for requantification. However you will run into the problem there are no common isotopes between certain runs. This can be solved partly by reference method that does not require common isotopes between all runs, but requires common isotopes only between reference run and saturated runs. However there still can be situations in which there are no isotopes in common beow saturation theshold between reference and quantitation run, in which case requantification is impossible.

sgibb commented 8 years ago

@pavel-shliaha asked me to look at some wired results of the detector saturation correction using the reference method.

DDPHACYSTVFDK

99.92 % of the signal after reference requantification is from the 500 fmol samples. Why this happen? If you look at the fData(x)$isotopicDistr columns you can see that there are not any information about the isotopic distribution available for all samples except the 500 fmol ones and a single 75 fmol.

As fData(x)$synapterPlgsAgreement shows there was no transfer because the grid search could not match ident and quant run. IHMO no requantification could solve this. What to do? Would it be a good solution not to requantificated any peptides where some sample could not matched in the grid search?

load_all("../../synapter/")
combMSnSet <- readRDS(file.path("result", "combMSnSet.RDS"))

combScheme <- list("25" = 6:7, "50" = 11:13, "75" = 14:16,
                   "100" = 1:3, "250" = 4:5, "500" = 8:10)
j <- unlist(combScheme)
i <- which(featureNames(combMSnSet) == "DDPHACYSTVFDK")
dd <- combMSnSet[i]

fvarLabels(dd) <- gsub("29022012_Ecoli_spikes_", "", fvarLabels(dd))
iso <- fData(dd)[grepl("isotopicDistr", fvarLabels(dd))][j]
agree <- fData(dd)[grepl("synapterPlgsAgreement", fvarLabels(dd))][j]
rownames(iso) <- rownames(agree) <- NULL

e <- exprs(dd)[, j]
r <- exprs(requantify(dd, method="reference", sat=1e5))[, j]
rownames(e) <- rownames(r) <- NULL

e
#  29022012_Ecoli_spikes_25fmol_04  29022012_Ecoli_spikes_25fmol_06
#                            14090                            13879
#  29022012_Ecoli_spikes_50fmol_02  29022012_Ecoli_spikes_50fmol_04
#                            41539                            28718
#  29022012_Ecoli_spikes_50fmol_07  29022012_Ecoli_spikes_75fmol_02
#                            47001                            47370
#  29022012_Ecoli_spikes_75fmol_04  29022012_Ecoli_spikes_75fmol_06
#                            53378                            41788
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
#                            44478                            77878
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
#                            70682                           159945
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
#                           124565                           185455
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
#                           235531                           239742

r
#  29022012_Ecoli_spikes_25fmol_04  29022012_Ecoli_spikes_25fmol_06
#                                0                                0
#  29022012_Ecoli_spikes_50fmol_02  29022012_Ecoli_spikes_50fmol_04
#                                0                                0
#  29022012_Ecoli_spikes_50fmol_07  29022012_Ecoli_spikes_75fmol_02
#                                0                                0
#  29022012_Ecoli_spikes_75fmol_04  29022012_Ecoli_spikes_75fmol_06
#                                0                              518
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
#                                0                                0
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
#                                0                                0
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
#                                0                           185455
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
#                           235531                           239742

iso
#   isotopicDistr.25fmol_04 isotopicDistr.25fmol_06 isotopicDistr.50fmol_02
# 1                    <NA>                    <NA>                    <NA>
#   isotopicDistr.50fmol_04 isotopicDistr.50fmol_07 isotopicDistr.75fmol_02
# 1                    <NA>                    <NA>                    <NA>
#   isotopicDistr.75fmol_04 isotopicDistr.75fmol_06 isotopicDistr.100fmol_02
# 1                    <NA>         2_0:321;2_1:197                     <NA>
#   isotopicDistr.100fmol_04 isotopicDistr.100fmol_06 isotopicDistr.200fmol_02
# 1                     <NA>                     <NA>                     <NA>
#   isotopicDistr.200fmol_04
# 1                     <NA>
#                                                                                                         isotopicDistr.500fmol_02
# 1 2_0:40003;2_1:32600;2_2:18058;2_3:6621;2_4:1735;2_5:421;2_6:127;3_0:32190;3_1:25740;3_2:18706;3_3:6798;3_4:1939;3_5:435;3_6:82
#                                                                                                         isotopicDistr.500fmol_04
# 1 2_0:53185;2_1:45888;2_2:26425;2_3:9358;2_4:2456;2_5:531;2_6:119;3_0:36456;3_1:29678;3_2:20958;3_3:7640;3_4:2300;3_5:469;3_6:68
#                                                                                                  isotopicDistr.500fmol_06
# 1 2_0:53234;2_1:46184;2_2:24468;2_3:8270;2_4:2146;2_5:463;3_0:38756;3_1:34138;3_2:21667;3_3:7529;3_4:2267;3_5:493;3_6:127

agree
#   synapterPlgsAgreement.25fmol_04 synapterPlgsAgreement.25fmol_06
# 1            no_synapter_transfer            no_synapter_transfer
#   synapterPlgsAgreement.50fmol_02 synapterPlgsAgreement.50fmol_04
# 1            no_synapter_transfer            no_synapter_transfer
#   synapterPlgsAgreement.50fmol_07 synapterPlgsAgreement.75fmol_02
# 1            no_synapter_transfer            no_synapter_transfer
#   synapterPlgsAgreement.75fmol_04 synapterPlgsAgreement.75fmol_06
# 1            no_synapter_transfer            no_synapter_transfer
#   synapterPlgsAgreement.100fmol_02 synapterPlgsAgreement.100fmol_04
# 1             no_synapter_transfer             no_synapter_transfer
#   synapterPlgsAgreement.100fmol_06 synapterPlgsAgreement.200fmol_02
# 1             no_synapter_transfer             no_synapter_transfer
#   synapterPlgsAgreement.200fmol_04 synapterPlgsAgreement.500fmol_02
# 1             no_synapter_transfer                            agree
#   synapterPlgsAgreement.500fmol_04 synapterPlgsAgreement.500fmol_06
# 1                            agree                            agree

IGEEYISDLDQLR

Same problem as above but just for one 500 fmol sample.


e
#  29022012_Ecoli_spikes_25fmol_04  29022012_Ecoli_spikes_25fmol_06
#                            17933                            24056
#  29022012_Ecoli_spikes_50fmol_02  29022012_Ecoli_spikes_50fmol_04
#                            61520                            55124
#  29022012_Ecoli_spikes_50fmol_07  29022012_Ecoli_spikes_75fmol_02
#                            55565                            74681
#  29022012_Ecoli_spikes_75fmol_04  29022012_Ecoli_spikes_75fmol_06
#                            77651                            74222
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
#                            89107                           113455
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
#                           105244                           225785
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
#                           215329                           214903
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
#                           252642                           280309

r
#  29022012_Ecoli_spikes_25fmol_04  29022012_Ecoli_spikes_25fmol_06
#                          17933.0                          24056.0
#  29022012_Ecoli_spikes_50fmol_02  29022012_Ecoli_spikes_50fmol_04
#                          61520.0                          55124.0
#  29022012_Ecoli_spikes_50fmol_07  29022012_Ecoli_spikes_75fmol_02
#                          55565.0                          74681.0
#  29022012_Ecoli_spikes_75fmol_04  29022012_Ecoli_spikes_75fmol_06
#                          77651.0                          74222.0
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
#                          89107.0                         113455.0
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
#                         105244.0                         225785.0
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
#                         215329.0                              0.0
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
#                         252642.0                         318415.1

iso
#                      isotopicDistr.25fmol_04
# 1 2_0:8013;2_1:6519;2_2:2411;2_3:749;2_4:241
#                                       isotopicDistr.25fmol_06
# 1 2_0:10824;2_1:8347;2_2:3532;2_3:861;3_0:197;3_1:175;3_2:120
#                                 isotopicDistr.50fmol_02
# 1 2_0:27650;2_1:21582;2_2:9207;2_3:2400;2_4:502;2_5:179
#                                                isotopicDistr.50fmol_04
# 1 2_0:23662;2_1:19309;2_2:8549;2_3:2520;2_4:674;2_5:244;3_0:108;3_1:58
#                                 isotopicDistr.50fmol_07
# 1 2_0:24159;2_1:19193;2_2:8596;2_3:2628;2_4:802;2_5:187
#                                  isotopicDistr.75fmol_02
# 1 2_0:32999;2_1:26262;2_2:11456;2_3:3084;2_4:654;2_5:226
#                                         isotopicDistr.75fmol_04
# 1 2_0:34799;2_1:27407;2_2:11170;2_3:3322;2_4:727;3_0:144;3_1:82
#                                                  isotopicDistr.75fmol_06
# 1 2_0:32917;2_1:25482;2_2:11430;2_3:3056;2_4:830;2_5:227;3_0:170;3_1:110
#                                                isotopicDistr.100fmol_02
# 1 2_0:39872;2_1:30827;2_2:13170;2_3:3771;2_4:972;2_5:236;3_0:188;3_1:71
#                                          isotopicDistr.100fmol_04
# 1 2_0:50963;2_1:39987;2_2:16566;2_3:4480;2_4:1097;3_0:228;3_1:134
#                                  isotopicDistr.100fmol_06
# 1 2_0:46435;2_1:38035;2_2:15497;2_3:4051;2_4:1019;2_5:207
#                                         isotopicDistr.200fmol_02
# 1 2_0:97225;2_1:80863;2_2:35200;2_3:9747;2_4:2216;2_5:455;2_6:79
#                                                                                  isotopicDistr.200fmol_04
# 1 2_0:91501;2_1:76143;2_2:32803;2_3:10120;2_4:2229;2_5:504;2_6:143;3_0:653;3_1:560;3_2:430;3_3:160;3_4:83
#   isotopicDistr.500fmol_02
# 1                     <NA>
#                                                                                  isotopicDistr.500fmol_04
# 1 1_0:401;1_1:372;2_0:94256;2_1:83636;2_2:51768;2_3:17209;2_4:3779;2_5:807;2_6:104;3_0:128;3_1:100;3_2:82
#                                                                                    isotopicDistr.500fmol_06
# 1 1_0:413;1_1:379;2_0:103359;2_1:95753;2_2:56364;2_3:18257;2_4:4074;2_5:751;2_6:150;3_0:361;3_1:259;3_2:189

agree
#   synapterPlgsAgreement.25fmol_04 synapterPlgsAgreement.25fmol_06
# 1                           agree                           agree
#   synapterPlgsAgreement.50fmol_02 synapterPlgsAgreement.50fmol_04
# 1                           agree                           agree
#   synapterPlgsAgreement.50fmol_07 synapterPlgsAgreement.75fmol_02
# 1                           agree                           agree
#   synapterPlgsAgreement.75fmol_04 synapterPlgsAgreement.75fmol_06
# 1                           agree                           agree
#   synapterPlgsAgreement.100fmol_02 synapterPlgsAgreement.100fmol_04
# 1                            agree                            agree
#   synapterPlgsAgreement.100fmol_06 synapterPlgsAgreement.200fmol_02
# 1                            agree                            agree
#   synapterPlgsAgreement.200fmol_04 synapterPlgsAgreement.500fmol_02
# 1                            agree             no_synapter_transfer
#   synapterPlgsAgreement.500fmol_04 synapterPlgsAgreement.500fmol_06
# 1                            agree                            agree

BTW: the "sum" requantification isn't working for these both examples (resulting in zero intensity for all samples).

pavel-shliaha commented 8 years ago

Thanks for having a look into it. In both cases it is the artifact of plotting requantified results, not of requantification itself. What I mean is that the code to plot requantification is just for the paper it is not a part of basic synapter functionality (unless we want to give visualisaion tool to the user?) About the peptides:

1) for "DDPHACYSTVFDK" there is a plot before ("none") requantifiction and after. If there is no transfer then where does the "none" plot come from? 2) for "IGEEYISDLDQLR" when making the plots IGNORE the peptide measurements in which transfer did not happen, i.e. the 500fmol point should be mean (c (225785.0, 318415.1)), not mean (c (225785.0, 318415.1, 0)). I am still confused why you loose a measurement even if you do not transfer by synapter (I always run code with "rescue" option).

sgibb commented 8 years ago

The "none" plot comes from the "rescued" identification data. Because there is no match in the grid search we could not transfer quantitation data. In the current approach we just have the isotopic distribution for the quantitation data (because we don't read the ident pep3d data yet).

It would be a big pain but it seems we need to import the ident pep3d data as well.

sgibb commented 8 years ago

Sorry, my fault. It seems that the rescueEMRTs method rescues just the count data. That's why we lose the isotopic distribution information here.

sgibb commented 8 years ago

It is not as easy as I thought. rescueEMRTs copies data from the merged data (ident peptide data and quant peptide data; used for the lowess model) to the matched data (results of the grid search). But the peptide data don't contain any isotopic distribution information. That's why rescueEMRTs has to look up at least the isotopic distribution in the pep3D data.

sgibb commented 8 years ago

After fixing #108 the isotopicDistr column is rescued as well and the figures look better (but even now the "sum" seems to be more sufficient than the "reference" method):

ions_reference2 ions_reference3

pavel-shliaha commented 8 years ago

@sgibb does the saturation threshold represent intensity for an ion (lives in isotopic distribution column of fData) or for the peptide (lives in expression table)? It has to be consistent in all types of saturation correction.

sgibb commented 8 years ago

If the intensity of an ion is higher than saturationThreshold it is considered as saturated (for all methods).

pavel-shliaha commented 8 years ago

@sgibb can you also clarify what happend to data that does not need to be corrected? E.g. if saturation threshold is 10k and there is two peptides:

1) should not be corrected (i.e. no ions over 10k in the peptide) 2) should be corrected but cannot be corrected (i.e. there are no ions below 10k to correct)

pavel-shliaha commented 8 years ago

It seems synapter introduces non 0 (and non NA) values into expression matrix, when it corrects saturation by th methods). E.g.

I think this happens because correction works with the isotopicDistr columns and they contain data even if the correpsonding sample has NA in expression table.

Could we please make it so, that if a sample had an NA in expression data for a particular peptide it will still be NA after correction?

pavel-shliaha commented 8 years ago

saturation should be corrected focusing on particular ions (i.e. particular isotope and charge state), i.e. that synapter:::.sumIsotopes() should be removed and the function should be modified to work with individual charge states.

sgibb commented 8 years ago

Overwritting missing values (NAs) is fixed.
Why? And how should we predict the Counts (sum of all ions from a single ion (isotope + charge state)? Using the sequence information we can predict the distribution of the isotopes but not the different charge states (how to know that there are 2+, 3+ and 4+ in the example below and that 4+ could be neglected). Using the information of just one charge state underestimates the counts:

## use the same example as before (without any saturation threshold)
poi <- combMSNset[grep("^LAMQEFMILPVGAANFR$", featureNames(combMSNset)),]
iso <- fData(poi)[, grep("isotopicDistr", colnames(fData(combMSNset)))]
## just use the first run because there are no missing values
m <- synapter:::.isotopicDistr2matrix(iso)[1,]
m
#   2_0   2_1   2_2   2_3   2_4   2_5   2_6   3_0   3_1   3_2   3_3   3_4   3_5
# 44772 52207 36735 15228  4524  1147   244 41226 44926 36857 21123  9240  2494
#   3_6   3_7   4_0   4_1   4_2
#   324    93    71   122    84

## find the isotope number
iIsotope <- as.numeric(gsub("^[0-9]+_([0-9]+)$", "\\1", colnames(x))) + 1L
iIsotope
#  [1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3
nIsotopes <- max(iIsotope)
nIsotopes
# [1] 8

## calculate theoretical isotopic distribution
props <- BRAIN::calculateIsotopicProbabilities(
  BRAIN::getAtomsFromSeq("LAMQEFMILPVGAANFR"), nrPeaks=nIsotopes)
props
# [1] 0.3040745888 0.3230766302 0.2112500172 0.1026649967 0.0403559650 0.0134161742
# [7] 0.0038789509 0.0009942712
m/props
#          2_0          2_1          2_2          2_3          2_4          2_5
# 1.472402e+05 1.615932e+05 1.738935e+05 1.483271e+05 1.121024e+05 8.549382e+04
#          2_6          3_0          3_1          3_2          3_3          3_4
# 6.290361e+04 4.146354e+07 1.477466e+05 1.140813e+05 9.999052e+04 9.000146e+04
#          3_5          3_6          3_7          4_0          4_1          4_2
# 6.180003e+04 2.414995e+04 2.397555e+04 7.140909e+04 4.012173e+02 2.600002e+02

## estimated counts (nearly 1/3 of the true counts)
mean(m/props[iIsotope])
# [1] 118852
median(m/props[iIsotope])
# [1] 137317.7

## using the current sum approach
s <- synapter:::.sumIsotopes(t(m))
s
#          0     1     2     3     4    5   6  7
# [1,] 86069 97255 73676 36351 13764 3641 568 93

## estimated counts (even underestimated but nearly the same as the true counts)
mean(s/props)
# [1] 267417.1
median(s/props)
# [1] 292040

exprs(poi)[1]
# [1] 311417

sgibb commented 8 years ago

An alternative would be to estimate the mean/median/weighted.mean counts for each charge state individually and sum them up afterwards, e.g.:

charge <- as.numeric(gsub("^([0-9]+)_[0-9]+$", "\\1", colnames(x)))
thmean <- tapply(m/props[iIsotope], charge, mean)
thmean
#           2           3           4
# 127364.8301 155846.7826    336.2493
sum(thmean)
# [1] 283547.9
thmed <- tapply(m/props[iIsotope], charge, median)
thmed
#           2           3           4
# 147240.1892 156763.8922    377.6194
sum(thmed)
# [1] 304381.7

pavel-shliaha commented 8 years ago

actually now that I think about it, the best way is to use all ions below saturation thsehold individually without minding the charge state.So if you have

1_0, 1_1, 1_2 2_0, 2_1, 2_2

and 2_0 is below the threshold, use 1_0, 1_1, 1_2, 2_1, 2_2, just as you dd before, but dont sum isotopes across charge states

pavel-shliaha commented 8 years ago

@sgibb another request (sorry). I think there is another problem, we need to only repredict intensity of the peptide onlyin the runs in which it is saturated. It seems that there is a lot of noise when repredicting from low intensity peptide. I.e. if there is a situation: peptide 1 is 5e3 in run 1 and 5e5 in run 2 and saturation threshold is 1e4, then 5e3 should not be touched (no ions over saturation threshold) and 5e5 should be requantified since it does have ions over saturation threshold.

PS1: Note that noise from reprediction is additive, i.e. if there is an error repredicting 5e3 and an error repredicting 5e5, then the ratio might have additive error. I think that this is shy we have so much "noise" in Kuharev's data

PS2: if a user does want to repredict all, then he can use requantifyAll = TRUE

sgibb commented 8 years ago

@pavel-shliaha IMHO this is already working as you want it:

msn <- readRDS("refCombMSNSet.RDS")[1, 1:2]
## store and rewrote fData (MSnbase subsetting doesn't remove the other
## isotopicDistr columns)
iso <- fData(msn)[, c("isotopicDistr.S130423_05", "isotopicDistr.S130423_07")]
fData(msn) <- iso
iso
# LAMQEFMILPVGAANFR 2_0:44772;2_1:52207;2_2:36735;2_3:15228;2_4:4524;2_5:1147;2_6:244;3_0:41226;3_1:44926;3_2:36857;3_3:21123;3_4:9240;3_5:2494;3_6:324;3_7:93;4_0:71;4_1:122;4_2:84
#                                                                                   isotopicDistr.S130423_07
# LAMQEFMILPVGAANFR 2_0:37377;2_1:39573;2_2:25620;2_3:10465;3_0:34237;3_1:34126;3_2:30939;3_3:19062;3_4:6166

m <- synapter:::.isotopicDistr2matrix(iso)
satTh <- 5e4

## just one ion in the run 5 is saturated
m > satTh
#                            2_0   2_1   2_2   2_3   2_4   2_5   2_6   3_0   3_1
# isotopicDistr.S130423_05 FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# isotopicDistr.S130423_07 FALSE FALSE FALSE FALSE    NA    NA    NA FALSE FALSE
#                            3_2   3_3   3_4   3_5   3_6   3_7   4_0   4_1   4_2
# isotopicDistr.S130423_05 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# isotopicDistr.S130423_07 FALSE FALSE FALSE    NA    NA    NA    NA    NA    NA

exprs(msn)
#                   S130423_05 S130423_07
# LAMQEFMILPVGAANFR     311417     237565

## requantify all (even those runs that are not saturated)
msnT <- requantify(msn, method="th.mean", saturationThreshold=satTh, requantifyAll=TRUE)
exprs(msnT)
#                   S130423_05 S130423_07
# LAMQEFMILPVGAANFR   277843.1   257783.3

## requantify just saturated once (please not run 7 was not affacted)
msnF <- requantify(msn, method="th.mean", saturationThreshold=satTh, requantifyAll=FALSE)
exprs(msnF)
#                   S130423_05 S130423_07
# LAMQEFMILPVGAANFR   277843.1     237565

pavel-shliaha commented 8 years ago

@sgibb I think I finally understand what the problem might be:

when you perform combination of th table you use all charge states: e.g.

what this results in is that you predict 2_0 using 1_0 and 1_1, but you should only be using charge state 2, not 1. I understand that the idea of th is that:

1) there is a theoretical peptide distribution: lets call it thDistr 2) real peptide distribution is the theoretical distribution multiplied by a certain constant K, so observed distribution is: obsDistr = thDistr * K 3) you calculate K from unsaturated ions K = obsDistr[unsat]/thDistr[unsat] and then you multiply it by theoretical proportion of saturated ions obsDistr[sat] = thDistr[sat] * K 4) however K is ONLY constant within 1 charge state. E.g. K for charge 1 is mean (c(2655, 2249)) and K for charge 2 is mean (c(52799, 31883)) = 42341.41. But you do mean (c(2655, 2249, 52799, 31883)) = 22396.91. So you multiply obsDistr[sat]* 22396.91 instead of obsDistr[sat]* 42341.41. This results in a paradox with less intensity when you perform calculation then when you started and all the noise we see.

sgibb commented 8 years ago

I think you now understand the algorithm but your point 4) is wrong.

Just to ensure that we talking about the same piece of code. The following code is the same I send you yesterday:

library("synapter")

combMSNset <- readRDS("refCombMSNSet.RDS")

## fetch peptide of interest
poi <- combMSNset[grep("^MTDLDLAGK$", featureNames(combMSNset)),]
## grep isotopic distribution
iso <- fData(poi)[, grep("isotopicDistr", fvarLabels(combMSNset))]
## turn into a matrix
m <- synapter:::.isotopicDistr2matrix(iso)
m
#        1_0  1_1   2_0   2_1   2_2  1_2  2_3 1_3 2_4
#  [1,] 1511  629 30859 14763  3525   NA   NA  NA  NA
#  [2,] 1166  470 31846 16087  4644  184   NA  NA  NA
#  [3,] 1172  441 30856 15004  3296  191  319  NA  NA
#  [4,]  371  155 25275 11440  3737   NA   NA  NA  NA
#  [5,]  950  349 22497 11292  2408   NA   NA  NA  NA
#  [6,] 8390 3343 75227 51491 22659 1148   NA 203  NA
#  [7,] 7569 2843 72229 51891 25683  849   NA 170  NA
#  [8,] 8689 3326 66247 48676 23591  869   NA 286  NA
#  [9,] 8593 3078 56700 41174 20213  976   NA  NA  NA
# [10,] 8451 2956 56708 41278 20537  878 3640 158 880

m <- synapter:::.isotopicDistr2matrix(f)
m

saturationThreshold <- 3e4

# which ions are below threshold?
unsat <- m < saturationThreshold
unsat
#        1_0  1_1   2_0   2_1  2_2  1_2  2_3  1_3  2_4
#  [1,] TRUE TRUE FALSE  TRUE TRUE   NA   NA   NA   NA
#  [2,] TRUE TRUE FALSE  TRUE TRUE TRUE   NA   NA   NA
#  [3,] TRUE TRUE FALSE  TRUE TRUE TRUE TRUE   NA   NA
#  [4,] TRUE TRUE  TRUE  TRUE TRUE   NA   NA   NA   NA
#  [5,] TRUE TRUE  TRUE  TRUE TRUE   NA   NA   NA   NA
#  [6,] TRUE TRUE FALSE FALSE TRUE TRUE   NA TRUE   NA
#  [7,] TRUE TRUE FALSE FALSE TRUE TRUE   NA TRUE   NA
#  [8,] TRUE TRUE FALSE FALSE TRUE TRUE   NA TRUE   NA
#  [9,] TRUE TRUE FALSE FALSE TRUE TRUE   NA   NA   NA
# [10,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE

# multiplie ion matrix by unsat matrix (saturated ions are treated as zero)
m <- m * unsat
m
#        1_0  1_1   2_0   2_1   2_2  1_2  2_3 1_3 2_4
#  [1,] 1511  629     0 14763  3525   NA   NA  NA  NA
#  [2,] 1166  470     0 16087  4644  184   NA  NA  NA
#  [3,] 1172  441     0 15004  3296  191  319  NA  NA
#  [4,]  371  155 25275 11440  3737   NA   NA  NA  NA
#  [5,]  950  349 22497 11292  2408   NA   NA  NA  NA
#  [6,] 8390 3343     0     0 22659 1148   NA 203  NA
#  [7,] 7569 2843     0     0 25683  849   NA 170  NA
#  [8,] 8689 3326     0     0 23591  869   NA 286  NA
#  [9,] 8593 3078     0     0 20213  976   NA  NA  NA
# [10,] 8451 2956     0     0 20537  878 3640 158 880

# find charge and isotope number from column names
cn <- as.numeric(unlist(strsplit(colnames(m), "_", fixed=TRUE)))
sel <- as.logical(seq_along(cn) %% 2L)

charges <- cn[sel]
# [1] 1 1 2 2 2 1 2 1 2

## add + 1L because indices in R start with 1
iIsotopes <- cn[!sel] + 1L
# [1] 1 2 1 2 3 3 4 4 5

## find maximal number of isotopes
nIsotopes <- max(iIsotopes)
# [1] 5

## use BRAIN to caluclation isotopic distribution for 5 isotopes
probs <- BRAIN::calculateIsotopicProbabilities(
  BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=nIsotopes)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698

## divide each count by its corresponding proportion
th <- t(t(m)/probs[iIsotopes])
#              1_0        1_1      2_0      2_1       2_2       1_2       2_3
#  [1,]  2655.2029  2249.6011     0.00 52799.46  31883.36        NA        NA
#  [2,]  2048.9520  1680.9420     0.00 57534.71  42004.63  1664.266        NA
#  [3,]  2059.4955  1577.2243     0.00 53661.39  29812.07  1727.580  10088.86
#  [4,]   651.9393   554.3532 44414.46 40914.84  33800.88        NA        NA
#  [5,]  1669.3863  1248.1888 39532.83 40385.53  21780.18        NA        NA
#  [6,] 14743.3170 11956.1468     0.00     0.00 204948.93 10383.573        NA
#  [7,] 13300.6158 10167.9107     0.00     0.00 232300.78  7679.140        NA
#  [8,] 15268.7344 11895.3468     0.00     0.00 213378.80  7860.039        NA
#  [9,] 15100.0385 11008.3817     0.00     0.00 182825.05  8827.846        NA
# [10,] 14850.5092 10572.0521     0.00     0.00 185755.60  7941.443 115120.53
#            1_3      2_4
#  [1,]       NA       NA
#  [2,]       NA       NA
#  [3,]       NA       NA
#  [4,]       NA       NA
#  [5,]       NA       NA
#  [6,] 6420.183       NA
#  [7,] 5376.508       NA
#  [8,] 9045.184       NA
#  [9,]       NA       NA
# [10,] 4996.990 118939.8

## replace the zeros (saturated ions) by NA
th[th == 0L] <- NA_real_
th
#              1_0        1_1      2_0      2_1       2_2       1_2       2_3
#  [1,]  2655.2029  2249.6011       NA 52799.46  31883.36        NA        NA
#  [2,]  2048.9520  1680.9420       NA 57534.71  42004.63  1664.266        NA
#  [3,]  2059.4955  1577.2243       NA 53661.39  29812.07  1727.580  10088.86
#  [4,]   651.9393   554.3532 44414.46 40914.84  33800.88        NA        NA
#  [5,]  1669.3863  1248.1888 39532.83 40385.53  21780.18        NA        NA
#  [6,] 14743.3170 11956.1468       NA       NA 204948.93 10383.573        NA
#  [7,] 13300.6158 10167.9107       NA       NA 232300.78  7679.140        NA
#  [8,] 15268.7344 11895.3468       NA       NA 213378.80  7860.039        NA
#  [9,] 15100.0385 11008.3817       NA       NA 182825.05  8827.846        NA
# [10,] 14850.5092 10572.0521       NA       NA 185755.60  7941.443 115120.53
#            1_3      2_4
#  [1,]       NA       NA
#  [2,]       NA       NA
#  [3,]       NA       NA
#  [4,]       NA       NA
#  [5,]       NA       NA
#  [6,] 6420.183       NA
#  [7,] 5376.508       NA
#  [8,] 9045.184       NA
#  [9,]       NA       NA
# [10,] 4996.990 118939.8

## th.mean: calculate mean for each charge state
r <- t(MSnbase:::utils.applyColumnwiseByGroup(t(th), groupBy = charges,
                                              colMeans, na.rm=TRUE))
#                1         2
#  [1,]  2452.4020  42341.41
#  [2,]  1798.0533  49769.67
#  [3,]  1788.1001  31187.44
#  [4,]   603.1462  39710.06
#  [5,]  1458.7876  33899.51
#  [6,] 10875.8049 204948.93
#  [7,]  9131.0438 232300.78
#  [8,] 11017.3261 213378.80
#  [9,] 11645.4220 182825.05
# [10,]  9590.2486 139938.66

## (this MSnbase function is equivalent to (but the former is more general):
r <- cbind(rowMeans(th[, c(1:2, 6, 8)], na.rm=TRUE),
           rowMeans(th[, c(3:5, 7, 9)], na.rm=TRUE))
#             [,1]      [,2]
#  [1,]  2452.4020  42341.41
#  [2,]  1798.0533  49769.67
#  [3,]  1788.1001  31187.44
#  [4,]   603.1462  39710.06
#  [5,]  1458.7876  33899.51
#  [6,] 10875.8049 204948.93
#  [7,]  9131.0438 232300.78
#  [8,] 11017.3261 213378.80
#  [9,] 11645.4220 182825.05
# [10,]  9590.2486 139938.66

## finally sum the charge states to get the final counts
r <- rowSums(r, na.rm=TRUE)
#  [1]  44793.81  51567.72  32975.54  40313.21  35358.30 215824.73 241431.82
#  [8] 224396.12 194470.47 149528.91
rAllT <- r

## if requantifyAll != TRUE: find all runs without any saturation
runUnsat <- which(rowSums(!unsat, na.rm=TRUE) == 0L)
# [1] 4 5
## and replace their theoritical values with the original experimental ones
r[runUnsat] <- rowSums(m[runUnsat, , drop=FALSE], na.rm=TRUE)
rAllF <- r
#  [1]  44793.81  51567.72  32975.54  40978.00  37496.00 215824.73 241431.82
#  [8] 224396.12 194470.47 149528.91

rAllF == rAllT
#  [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

## compare to original algorithm
all.equal(rAllT, as.vector(exprs(requantify(poi, method="th.mean", saturationThreshold=saturationThreshold, requantifyAll=TRUE))))
# [1] TRUE

all.equal(rAllF, as.vector(exprs(requantify(poi, method="th.mean", saturationThreshold=saturationThreshold, requantifyAll=FALSE))))
# [1] TRUE

Now regarding your points:

1) there is a theoretical peptide distribution: lets call it thDistr

Correct, we use the BRAIN package to get this distribution from the peptide sequence (code is taken from above):

## use BRAIN to caluclation isotopic distribution for 5 isotopes
probs <- BRAIN::calculateIsotopicProbabilities(
  BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=nIsotopes)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698

2) real peptide distribution is the theoretical distribution multiplied by a certain constant K, so observed distribution is: obsDistr = thDistr * K 3) you calculate K from unsaturated ions K = obsDistr[unsat]/thDistr[unsat] and then you multiply it by theoretical proportion of saturated ions obsDistr[sat] = thDistr[sat] * K

More or less correct, but I would say it in a slightly different way: The exists the true counts/intensity (ti) for a peptide but we didn't know it. ti is the sum of the isotope intensities (ii): ti = ii_1_0 + ii_1_1 + ii_2_0 + ii_2_1 + ii_... + ii_maxCharge_maxIsotopes We try to use the observed distribution to get ti. In a perfect experiment without bias the observed distribution would be the same as the theoretical one. That's why we estimate ti by the ratio of observed and theoretical distribution: ti = obsDistr/thDistr but we do that for each ion and charge state individually: ti = mean(ii_1_0/thDistr_0, ii_1_1/thDistr_1) + mean(ii_2_0/thDistr_0, ii_2_1/thDistr_1, ...) + ...

(that is IMHO exactly what you want to do in your 4) point; see below; you point 4 was correct for the code one week ago, which was indeed not correct)

Find the relevant code in the following lines:

## divide each count by its corresponding proportion
th <- t(t(m)/probs[iIsotopes])
#              1_0        1_1      2_0      2_1       2_2       1_2       2_3
#  [1,]  2655.2029  2249.6011     0.00 52799.46  31883.36        NA        NA
#  [2,]  2048.9520  1680.9420     0.00 57534.71  42004.63  1664.266        NA
#  [3,]  2059.4955  1577.2243     0.00 53661.39  29812.07  1727.580  10088.86
#  [4,]   651.9393   554.3532 44414.46 40914.84  33800.88        NA        NA
#  [5,]  1669.3863  1248.1888 39532.83 40385.53  21780.18        NA        NA
#  [6,] 14743.3170 11956.1468     0.00     0.00 204948.93 10383.573        NA
#  [7,] 13300.6158 10167.9107     0.00     0.00 232300.78  7679.140        NA
#  [8,] 15268.7344 11895.3468     0.00     0.00 213378.80  7860.039        NA
#  [9,] 15100.0385 11008.3817     0.00     0.00 182825.05  8827.846        NA
# [10,] 14850.5092 10572.0521     0.00     0.00 185755.60  7941.443 115120.53
#            1_3      2_4
#  [1,]       NA       NA
#  [2,]       NA       NA
#  [3,]       NA       NA
#  [4,]       NA       NA
#  [5,]       NA       NA
#  [6,] 6420.183       NA
#  [7,] 5376.508       NA
#  [8,] 9045.184       NA
#  [9,]       NA       NA
# [10,] 4996.990 118939.8

4) however K is ONLY constant within 1 charge state. E.g. K for charge 1 is mean (c(2655, 2249)) and K for charge 2 is mean (c(52799, 31883)) = 42341.41. ...

As you could see below, that are exactly the same numbers and the same approach we use currently:

r <- cbind(rowMeans(th[, c(1:2, 6, 8)], na.rm=TRUE),
           rowMeans(th[, c(3:5, 7, 9)], na.rm=TRUE))
#             [,1]      [,2]
#  [1,]  2452.4020  42341.41
#  [2,]  1798.0533  49769.67
#  [3,]  1788.1001  31187.44
#  [4,]   603.1462  39710.06
#  [5,]  1458.7876  33899.51
#  [6,] 10875.8049 204948.93
#  [7,]  9131.0438 232300.78
#  [8,] 11017.3261 213378.80
#  [9,] 11645.4220 182825.05
# [10,]  9590.2486 139938.66

## finally sum the charge states to get the final counts
r <- rowSums(r, na.rm=TRUE)
#  [1]  44793.81  51567.72  32975.54  40313.21  35358.30 215824.73 241431.82
#  [8] 224396.12 194470.47 149528.91

pavel-shliaha commented 8 years ago

@sgibb yeah sorry, I agree. I misread the algorithm

sgibb commented 8 years ago

Just for the record: We tried different packages for the theoretical isotopic distribution. They are all more or less the same but of course different from the observed one:

BRAIN::calculateIsotopicProbabilities(BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=5)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698

a <- BRAIN::getAtomsFromSeq("MTDLDLAGK")
form <- paste0(names(a), a, collapse="")
Rdisop::getIsotope(Rdisop::getMolecule(form, z = 0, maxisotopes = 5))[2,]
# [1] 0.562914475 0.284686413 0.112407334 0.032384339 0.007607439

set.seed(1234)
org <- OrgMassSpecR::IsotopicDistributionN("MTDLDLAGK", 0)$intensity
org/sum(org)
# [1] 0.5894 0.2685 0.1047 0.0286 0.0072 0.0014 0.0002

# charge 1
m[, c(1:2, 6, 7)]/rowSums(m[, c(1:2, 6, 7)], na.rm=TRUE)
#                                1_0       1_1        1_2       2_3
# isotopicDistr.S130423_05 0.7060748 0.2939252         NA        NA
# isotopicDistr.S130423_07 0.6406593 0.2582418 0.10109890        NA
# isotopicDistr.S130423_09 0.5520490 0.2077249 0.08996703 0.1502591
# isotopicDistr.S130423_11 0.7053232 0.2946768         NA        NA
# isotopicDistr.S130423_13 0.7313318 0.2686682         NA        NA
# isotopicDistr.S130423_06 0.6513469 0.2595295 0.08912352        NA
# isotopicDistr.S130423_08 0.6721428 0.2524643 0.07539295        NA
# isotopicDistr.S130423_10 0.6744024 0.2581496 0.06744800        NA
# isotopicDistr.S130423_12 0.6794497 0.2433779 0.07717245        NA
# isotopicDistr.S130423_14 0.5306750 0.1856201 0.05513344 0.2285714

# charge 2
m[, c(3:5, 7, 9)]/rowSums(m[, c(3:5, 7, 9)], na.rm=TRUE)
#                                2_0       2_1        2_2         2_3         2_4
# isotopicDistr.S130423_05 0.6278918 0.3003846 0.07172360          NA          NA
# isotopicDistr.S130423_07 0.6057021 0.3059703 0.08832760          NA          NA
# isotopicDistr.S130423_09 0.6236685 0.3032643 0.06661950 0.006447701          NA
# isotopicDistr.S130423_11 0.6248146 0.2828043 0.09238109          NA          NA
# isotopicDistr.S130423_13 0.6215156 0.3119596 0.06652485          NA          NA
# isotopicDistr.S130423_06 0.5036050 0.3447050 0.15169002          NA          NA
# isotopicDistr.S130423_08 0.4821599 0.3463949 0.17144516          NA          NA
# isotopicDistr.S130423_10 0.4782693 0.3514157 0.17031491          NA          NA
# isotopicDistr.S130423_12 0.4801545 0.3486751 0.17117041          NA          NA
# isotopicDistr.S130423_14 0.4608795 0.3354762 0.16690913 0.029583154 0.007151971

It seems that the higher isotopes are less abundant as they should be? @pavel-shliaha Could there be something similar to saturation detection on the lower border?

pavel-shliaha commented 8 years ago

Reference quantitation method was introduced to minimise the number of peptides that could not be requantified by sum method. By defenition of reference method it is clear that it should requantify the same amount or more of saturated peptides. However working with Kuhrev dataset there are 80 peptides that are successfully requantified by sum, but whih have all NAs in the reference method. These peptides are below: the code is available in:

"Y:\RAW\pvs22_QTOF_DATA_data3\synapter2paper\kuharev2015\bugs_investigation\for_bug_investigation_requant_reference"

[1] "FGLSLVR" [5] "FNVLASQPADFDR" [9] "VFLENVIR" [13] "VNFAMNVGK" [17] "QIVFEIPSETH" [21] "VTVQSLDVVR" [25] "LQTMVSHFTIDPSR" [29] "AYGPGIEPTGNMVK" [33] "EANNFLWPFK" [37] "GTVTDFPGFDER" [41] "LSCFAQTVSPAEK" [45] "LGDLPTSGQIR" [49] "IFVDEGPSMK" [53] "TYSYLTPDLWK" [57] "TLAQEVLGTTK" [61] "LLCGLLAER" [65] "LITPAVVSER" [69] "IEFELYDNVVPK" [73] "GSFPK" [77] "TVAVAGNPTDWK" "VPTANVSVVDLTCR" "LGANAILGVSMAAAR" "LAATIAQLPDQIGAK"
"LVSWYDNETGYSNK" "GSVNPECTLAQLGAAK" "AGAGHSNTLQVSTV"
"ISGLIYEETR" "LVNELTEFAK" "GADFLVTEVENGGSLGSK"
"AEWALR" "FVPSKPMCVEAFSEYPPLGR" "YQVTVIDAPGHR"
"ADLMLYVSK" "ALEEAGAEVEVK" "VPLPPLTEER"
"IWDSTDALELK" "VVLDDKDYFLFR" "IPFLVK"
"LLLDV" "FSVSPVVR" "ITSVNVGGMAFR"
"VLDEVVVDNFDQK" "VLDTGGPISVPVGR" "YSIGQPFVYPR"
"QINFGGTVETVYER" "AAIEDGWVPGK" "VGFFNPIASEK"
"VLAVGDGIAR" "ILFDYSK" "GFGFITPEDGSK"
"VNVYGGAVALGHPLGCSGAR" "FPAIIYGGK" "IWEEGSDEVLVK"
"AFFANPVLTGAVDK" "GTLEDPNLFIR" "LPQLGIEFSGPGAK"
"LVEALCAEHQINLIK" "LPEAGEDLELK" "LLDAQLSTGGIVDPSK"
"IPVLEQELVR" "YQAFTQADLTNLR" "LQAFSAAIESCNK"
"SLVNTYQEILK" "AINQGGLTSVAVR" "GFGFGQGAGALVHSE"
"FEASQVTEK" "QATFEEMIAR" "LCSLFYTNEEVAK"
"FGPIVSASLEK" "VLEGGIVIEDR" "IVDVVGTLSR"
"ILTEPNASITVQYK" "TSGTLEMNLK" "LGLLTETELAK"
"VATEFSETAPATLK" "VVPVP" "EASGVFDDLVR"
"FATLGIDPPK" "LAAQEVELK" "EGVLLSK"

sgibb commented 8 years ago

I don't check all peptides but there are at least the following problems in the current implementation:

"reference" looks for a an unsaturated run as reference. If there is none it will return NA, e.g. happens for "FGLSLVR" (we could easily change to the run with the highest intensity regardless of staturation or the highest number of unsaturated ions).
"reference" looks for common unsaturated ion between reference and the current run. If there is at least one run that has no unsaturated ions in common it will return NA for all runs, e.g. happens for "VPTANVSVVDLTCR" (we could easily change this to return just NA for the run without any common ions).

pavel-shliaha commented 8 years ago

you need to change that to look for a run that has a maximum number of unsaturated isotopes (not completely unsaturated).
this is inveitable, but do I understand correctly that sum in this situation would do the same.

pavel-shliaha commented 8 years ago

about theoretical methods. It is interesting, but it appears that there is a systematic intensity bias against lower intensity ions. To invetigate this I have taken all peptides that have all ions below saturation threshold in Kuharev's dataset and compared the observed proportion of an isotopes among all isotopes and then emulated what I would expect theoretically using BRAIN package. Shown below is a plot that has

1) x axis log2 of intensity for an ion 2) y axis log2 of ratio of observed proportion for an isotope against predicted, again only peptides having all ions below saturation threshold have been concidered.

what it shows is that there is a systematic bias against low intensity isotopes, introducing systematic error in requantification

lgatto commented 8 years ago

This could probably be accounted for using a similar approach than what we do for rt adjustment: regress with lowess and use it to correct/predict after correction for saturation.

pavel-shliaha commented 8 years ago

@lgatto thanks for the suggestion. @sgibb as we discussed yesterday the sum and refenrece detector saturation models as implemented now do not allow to do top3 quantitation. This happens since we simply through out isotopes over saturation threshold, decreasing intensities of the most abundant peptides. I propose a solution analogous to the one we use in theoretical modeling for isotopes for detector saturation: but in this case intensities of the peptides in the runs with detector saturation are modelled on intensities of peptides with ions under saturation threshold using ratios predicted by requantification. Assume we have 6 runs and there is a peptide: 3 times its under detector saturation and 3 times above. We use sum saturation correction and it requantifies as it does now. We then 1) convert the new requantified peptide intensities to PROPORTIONS x/sum(x). 2) calculate a CORRECTION FACTOR which is the ratio of intensities of peptides below saturation BEFORE requantitation to their PROPORTIONS, CORRECTION FACTOR is mean, median or weighted mean of these rations. We then multiply PROPORTIONS of saturated peptides by CORRECTION FACTOR . There should be an option to only do this requantifiaction for peptides for which there are isotopes over saturation threshold, leaving all peptide intensities for which there are no isotopes over saturation threshold intact.

sgibb commented 8 years ago

you need to change that to look for a run that has a maximum number of unsaturated isotopes (not completely unsaturated).

done.

this is inveitable, but do I understand correctly that sum in this situation would do the same.

yes, but there was a difference between reference and sum because I used na.rm=TRUE in sum. That results in counts that were 0 instead of NA. Is now fixed as well.

@lgatto I would assume that the lowess approach would just slightly improve the situation because the deviations are very large and nearly equally distributed on both sides of the horizontal zero line.

@pavel-shliaha Am I right that you ask me to implement the following:

prop <- exprs(MSnSetPostRequant)/rowSums(exprs(MSnSetPostRequant))
correctionFactor <- exprs(MSnSetPreRequant)/prop
exprs(MSnSetPostRequant) <- mean(correctionFactor) * prop

To be honest this sounds very weird. Why top3 should work for this rescaled values but not for requantified. (I am aware that the sum method removes the high intense ions and top3 could not be valid anymore (or becomes middle3 or low3)).

pavel-shliaha commented 8 years ago

@lgatto I agree with sebastian, I think there might be too much spread to accurately reauqntify even if the lowess lines overlapped with 0 horisontal lines

@sgibb 1)yes that is what I would like to do. But I would like only peptides with ions over saturation to be corrected

2) I can explain why top3 will not work if you simply remove high intense ions. The reason is that top3 reflects quantities difference between proteins WITHIN a run (i.e. absolute quantitation). This information might be sometimes useful, e.g. when you are interested what proteins are most abundant in your runs. Removing high intensity isotopes disproportinately penalises most abundant proteins. Also some people (like Kuharev) like to perform top3 to then convert this to relative protein abundances BETWEEN runs. They will then complain that algorithm distorts the values as well. I think top3 is not a good way to convert the peptide quantitation to protein quantitation,there are much better ways, but for now we need to make top3 also possible to be able to compare our results with Kuharev's

sgibb commented 8 years ago

@pavel-shliaha So there is a rescaleForTop3 function now for you to test. If it is useful and correct I will change the name and add some documentation. You use this function as follows:

newMSnSet <- rescaleForTop3(originalMSnSet, requantifiedMSnSet,
                            saturationThreshold, onlyForSaturatedRuns=TRUE)

(it uses the mean correction factor; we could provide more methods later.)

pavel-shliaha commented 8 years ago

@sgibb still too peptides are requantified by sum but not by reference: "FFQELR" "ESNEITIIINPYRETVCFSVEPVK". Can u have a look?

sgibb commented 8 years ago

It is the other way around: both are requantified by reference but not by sum (what makes absolutely sense to me):

## fetch peptide of interest
poi <- combMSNset[grep("^FFQELR$", featureNames(combMSNset)),]
## grep isotopic distribution
iso <- fData(poi)[, grep("isotopicDistr", fvarLabels(combMSNset))]

m <- synapter:::.isotopicDistr2matrix(iso)
m
#                            1_0   1_1  1_2 1_3 1_4  2_0  2_1 2_2 2_3
# isotopicDistr.S130423_05   632   307   NA  NA  NA 6099 2445 696  NA
# isotopicDistr.S130423_07    NA    NA   NA  NA  NA   NA   NA  NA  NA
# isotopicDistr.S130423_09 37909 16706 2711 501 174 8502 3463 677 125
# isotopicDistr.S130423_11    NA    NA   NA  NA  NA   NA   NA  NA  NA
# isotopicDistr.S130423_13    NA    NA   NA  NA  NA   NA   NA  NA  NA
# isotopicDistr.S130423_06    NA    NA   NA  NA  NA   NA   NA  NA  NA
# isotopicDistr.S130423_08    NA    NA   NA  NA  NA 5538 1675  NA  NA
# isotopicDistr.S130423_10    NA    NA   NA  NA  NA 4842 1439 484  NA
# isotopicDistr.S130423_12    NA    NA   NA  NA  NA 4128 1625 371  NA
# isotopicDistr.S130423_14    NA    NA   NA  NA  NA 3386 1290 349  NA

exprs(requantify(poi, method="sum", saturationThreshold=3e4))
#        S130423_05 S130423_07 S130423_09 S130423_11 S130423_13 S130423_06
# FFQELR         NA         NA         NA         NA         NA         NA
#        S130423_08 S130423_10 S130423_12 S130423_14
# FFQELR         NA         NA         NA         NA
exprs(requantify(poi, method="reference", saturationThreshold=3e4))
#        S130423_05 S130423_07 S130423_09 S130423_11 S130423_13 S130423_06
# FFQELR      10179         NA      70768         NA         NA         NA
#        S130423_08 S130423_10 S130423_12 S130423_14
# FFQELR         NA       6765       6124         NA