Closed pavel-shliaha closed 7 years ago
Laurent asked me to create a comment here:
I have had a look at the data and noticed that the saturation does not affect fragment quantities and lower abundance isotope. I would suggest the following easy to implement soultion:
1) when you load the P3D files keep only a single line per EMRT, but also record isotopic distribution of each EMRT. For example: write z2i0_1000_z2i1_500_z2i2_100_z3i0_200 which corresponds to a list of ions: charge 2 isotope with C13 = 0 has intensity of 1000, charge 2 isotope C13 = 1 with intensity of 500, charge 2 isotope C13 = 1 with intensity of 100, charge 3 isotope C13 = 0 with intensity of 200. Kepp this line for every EMRT
2) when matching fragments write the same but in form fragment_intensity, for example: b19_1000_y3_500.
Once you have applied a function of combining individual synapter analysis into MSnsets, you can apply a command of correcting saturation either on fragments or isotopes, by saying quantify on all fragments or isotopes which intensity is below certain threshold. Note that this is done AFTER several synapter analysis are combined into an MSnSet.
As we discussed on skype we will follow the first strategy (quantify on peptides). Now the isotopic distribution is stored in the Synapter object and the resulting MSnSet:
## isotopicDistr is formated: ion_z,ion_iso,ion_counts
tail(cbind(exprs(m), fData(m)["isotopicDistr"]))
# Synapter1 isotopicDistr
# IVDTWR 1198 2,0,838;2,1,360;1,0,30
# ESASHETPTDLRPVIPR 555 3,0,275;3,1,280;1,0,56;1,1,43
# GYDLPDEVFDENEK 775 2,0,331;2,1,293;2,2,151;1,0,37
# IFGLEK 64706 1,0,42448;1,1,17358;1,2,2956;1,3,417;2,0,1037;2,1,490;1,0,118;1,1,83
# LATMNIDIDLTSGGK 414 3,0,218;3,1,196;1,0,122
# QNLFNNTVK 1266 2,0,703;2,1,399;2,2,164;1,0,33
How we realize the requantification? Do we simple sum up all counts below a threshold, e.g. for the last peptide: satThreshold = 500
; newCount = 399 + 164 + 33
(596)?
To do requantification synapter needs to first select all ions that are present in all the runs under saturation threshold. It then sums the intensity of these ions for every run, which is the new quantitation value. For example:
run1: 2,0,838;2,1,360;1,0,30 run2: 2,0,8000;2,1,4000;1,0,30
saturation threshold: 5000
there is 1 isotope present in all runs below saturation threshold: 2.1. so the new quantitation value is now 360 + 30: 4000 + 30
So we just sum up common isotopes. For your example run 1 will be (360+30)/(4000+30) = 0.097 and run 2 will be 1.0? (these values are no counts anymore)
What do we do if there are
1) this will have to be loaded unfortunately from the P3D file 2) there will always be isotopes below saturation threshold. THe number of identified isotopes is expected to be correlated with peptide intensity. The more there is of a peptide, the more isotopes it will have over a saturation threshold. I.e. for high intensity peptides (above saturation threshold), there will be more isotopes recorded than for a low intensity peptide and these isotopes will be below saturation threshold 3) this situation is more complicated. Consider this: the peptide is identified in two runs in the first run it is high intensity and has 5 isotopes identified:
z2i0 - 10.000 z2i1 - 8000 z2i3 - 5000 z2i4 - 2000 z2i5 - 1000
for the second run the peptide intensity is low and it has the following distribution
z2i0 - 5000 z2i1 - 2000
saturation threshold is 6000. Now this becomes a problem since there are no peptide ions that are common that are below saturation threshold. The only option here is to predict the correct intensity of ions above saturation threshold, by using ions below saturation threshold in run 1. There is a simple function that can generate isotopic distribution and then we say given that
z2i3 - 5000 z2i4 - 2000 z2i5 - 1000, we should actually see z2i0 at 15000 and z2i1 at 10000. It should not be much work since the function is very simple (it is based on sample). See IsotopicDistributionN {OrgMassSpecR} for an example
1) this will have to be loaded unfortunately from the P3D file
That's a pain! I will do it later if the whole method works as expected.
Regarding 3.: In theory it sounds simple (like always) but here I used the last three peptides of my example dataset and try to calculate the original counts using theoretical proportions of the isotopes (ignoring any saturation). As you can see the calculated values are sometimes very good and sometimes (especially for the less abundant isotopes) very bad.
# I changed the isotopic distribution format for easier parsing, order is equal
GYDLPDEVFDENEK 775 2_0:331;2_1:293;2_2:151;1_0:37
IFGLEK 64706 1_0:42448;1_1:17358;1_2:2956;1_3:417;2_0:1037;2_1:490;1_0:118;1_1:83
QNLFNNTVK 1266 2_0:703;2_1:399;2_2:164;1_0:33
Using the BRAIN
package (which offers a similar function as OrgMassSpecR::IsotopicDistributionN
but is a lot of faster) I get the following results:
library("BRAIN")
(props <- calculateIsotopicProbabilities(getAtomsFromSeq("GYDLPDEVFDENEK"), nrPeaks=3))
# [1] 0.3958128 0.3450117 0.1722195
c("2_0"=331, "2_1"=293, "2_2"=151)/props
# 2_0 2_1 2_2
# 836.2540 849.2465 876.7883
# original counts are 775
(props <- calculateIsotopicProbabilities(getAtomsFromSeq("IFGLEK"), nrPeaks=4))
# [1] 0.65718493 0.26507158 0.06427314 0.01155455
c("1_0"=42448, "1_1"=17358, "1_2"=2956, "1_3"=417)/props
# 1_0 1_1 1_2 1_3
# 64590.65 65484.20 45991.22 36089.69
# original counts are 64706
(props <- calculateIsotopicProbabilities(getAtomsFromSeq("QNLFNNTVK"), nrPeaks=3))
# [1] 0.5474634 0.3145201 0.1056640
c("2_0"=703, "2_1"=399, "2_2"=164)/props
# 2_0 2_1 2_2
# 1284.104 1268.599 1552.090
# original counts are 1266
If I don't know the true counts (because I assume they are affected by saturation) I would tent to use the less abundant isotopes (because it is less likely that they are saturated) to calculate the "original" counts. But as shown above these less abundant isotopes results in very wired counts. Slowly I doubt that it would be possible to estimate the true counts in a reasonable way.
BTW: Would it be save to sum up the counts of equal isotopes of different charge states? (e.g. 1_1 + 2_1 + 3_1) If that would be possible (as I assume) the results will differ even more.
I understand the difficulty and it is expected. Perhaps it is indeed going to be complicated to calculate the intensity of saturated ions based on non-saturated ions, due to the fact that there is a more significant noise contribution to signal for less intense isotopes.
Also perhaps the situation outlined in my previous post is extremely rare and we dont even need this calculation. So I would prefer to have 2 separate functions: 1st - requantify with lower intensity isotopes, 2nd - predict intensity of higher abundant isotopes and requantify then.
The bottom line is I can only test whether this functionality is useful once it is available to me. Put it in synapter and I will test it using the saturation dataset we have,
Ok, I just finished the first attempt to realize this feature. It is still clumsy but I hope it is enough to start experimenting with it.
## start a new, clean R session
## install devtools if you haven't done it before
install.packages("devtools")
library("devtools")
## get the latest synapter with requantification support
install_github("lgatto/synapter@saturation")
library("synapter")
## load synapterdata example files
path <- system.file("extdata", package="synapterdata")
identfiles <- list.files(path, pattern="^HDMSe", full.names=TRUE)
quantfiles <- list.files(path, pattern="^MSe.*IA", full.names=TRUE)
pep3dfiles <- list.files(path, pattern="Pep3DAMRT", full.names=TRUE)
fasta <- unzip(file.path(path, "EcoliK12_enolase_UPSsimga_NB.fasta.zip"))
set.seed(1)
s <- vector(mode="list", length=length(identfiles))
## run standard synapter workflow
system.time(
for (i in seq(along=s)) {
s[[i]] <- Synapter(list(identpeptide=identfiles[i],
quantpeptide=quantfiles[i],
quantpep3d=pep3dfiles[i],
fasta=fasta))
filterUniqueDbPeptides(s[[i]])
filterIdentPepScore(s[[i]], method="BH", fdr=0.05)
filterQuantPepScore(s[[i]], method="BH", fdr=0.05)
filterIdentProtFpr(s[[i]], fpr=0.05)
filterQuantProtFpr(s[[i]], fpr=0.05)
filterIdentPpmError(s[[i]], ppm=20)
filterQuantPpmError(s[[i]], ppm=20)
mergePeptides(s[[i]])
modelRt(s[[i]], span=0.05)
searchGrid(s[[i]])
setBestGridParams(s[[i]])
findEMRTs(s[[i]])
})
## create sample names
sets <- paste0("ups", sub("^.*([0-9]{2,})fmol_.*([0-9]{2,})_IA.*",
"\\1.\\2", basename(identfiles)))
## create msnsets
ms <- lapply(s, as, Class="MSnSet")
## update sampleNames/fvarLables of the msnsets
ms <- mapply(function(m, s) {
sampleNames(m) <- s;
m <- updateFvarLabels(m, s)
m
}, m=ms, s=sets)
## combine msnsets
system.time(
msc <- Reduce(combine, ms)
)
# ignore peptides with no isotopicDistr data for now (rescued)
#i <- which(rowSums(!is.na(fData(msc)[, grep("isotopicDistr", fvarLabels(msc))])) == ncol(msc))
#
#msc <- msc[i]
Nothing new yet.
The next step would be the requantification. There is a new function called requantify
with three arguments (the MSnSet
; saturationThreshold
: the maximal unsaturated counts, method
: could be "sum"
: the sum of the common isotopes below the saturationThreshold
; "th.mean"
/"th.median"
/"th.weighted.mean"
: use the mean
/median
/weighted.median
of the isotopes below the saturationThreshold
multiplied with their theoretical probability of occurrence; the latter need the BRAIN
package installed (call ìnstall.packages("BRAIN")
before)).
summed <- requantify(msc, saturationThreshold=Inf, method="sum")
thmean <- requantify(msc, saturationThreshold=Inf, method="th.mean")
thmedian <- requantify(msc, saturationThreshold=Inf, method="th.median")
thwmean <- requantify(msc, saturationThreshold=Inf, method="th.weighted.mean")
Another new function is the synapterPlgsAgreement
discussed in https://github.com/lgatto/synapter/issues/73.
msc <- synapterPlgsAgreement(msc)
cols <- grepl("synapterPlgsAgreement|nIdentified|nAgree|nDisagree", fvarLabels(msc))
head(fData(msc)[, cols])
What is missing/what to do next:
@pavel-shliaha I am looking forward to your feedback.
A little technical comment/question regarding
The functions should be methods
Do you mean requantify
? Why so? Using methods when functions work well enough or there is no requirement for dispatching on different inputs is not worth it. (I have been using methods when functions where fit for purpose too often in the past)
@lgatto: Yes, I have seen that you change some methods to functions in some of your packages. What is the reason for that? Is there any drawback of S4 methods (except that dispatching needs some time and debugging is sometimes harder)?
cannot find requantify function. Where does it live?
requantify(msc, saturationThreshold=Inf, method="sum") function does not work very well.
1) easy to fix: if requantification is impossible can you not introduce NAs,but just leave the same quant values. Also introduce an additionl column "requantification" where you put TRUE, if requantification was succeful and FALSE if it was not. The reason is a user might not want to loose the data if requantification was not complete.
2) do I understand correctly that you re just removing the isotopes that are above saturation threshold.
e.g.
peptide in run 1 isotopes: 100, 10, 1 peptide in run 2 isotopes: 1e4, 1e3,1e2
if saturation threshold is 1e3, do you jut remove the ions above that and get
peptide in run 1: 100 + 10 + 1 peptide in run 2: 1e2
if so than this should not be the case. You have to remove the isotopes that are above saturation threshold in one of the runs from all of the runs. So the result should be:
peptide in run : 1 peptide in run 2: 1e2
1.) Is that really expected? I wouldn't expect that there are any original values in the MSnSet
after requantification. Do you really think we should implement it the way you suggested? If the user want to mix up original and requantified values he could easily do so:
r <- requantify(msc, saturationThreshold=1e3, method="sum")
notNA <- !is.na(exprs(r))
exprs(msc)[notNA] <- exprs(r)[notNA]
2.) This is not true. The current behaviour is already the way you want it to be. Default workflow:
library("devtools")
load_all("synapter")
path <- system.file("extdata", package="synapterdata")
identfiles <- list.files(path, pattern="^HDMSe", full.names=TRUE)
quantfiles <- list.files(path, pattern="^MSe.*IA", full.names=TRUE)
pep3dfiles <- list.files(path, pattern="Pep3DAMRT", full.names=TRUE)
fasta <- unzip(file.path(path, "EcoliK12_enolase_UPSsimga_NB.fasta.zip"))
set.seed(1)
s <- vector(mode="list", length=length(identfiles))
## run standard synapter workflow
system.time(
for (i in seq(along=s)) {
s[[i]] <- Synapter(list(identpeptide=identfiles[i],
quantpeptide=quantfiles[i],
quantpep3d=pep3dfiles[i],
fasta=fasta))
filterUniqueDbPeptides(s[[i]])
filterIdentPepScore(s[[i]], method="BH", fdr=0.05)
filterQuantPepScore(s[[i]], method="BH", fdr=0.05)
filterIdentProtFpr(s[[i]], fpr=0.05)
filterQuantProtFpr(s[[i]], fpr=0.05)
filterIdentPpmError(s[[i]], ppm=20)
filterQuantPpmError(s[[i]], ppm=20)
mergePeptides(s[[i]])
modelRt(s[[i]], span=0.05)
searchGrid(s[[i]])
setBestGridParams(s[[i]])
findEMRTs(s[[i]])
})
## create msnsets
ms <- lapply(s, as, Class="MSnSet")
## combine msnsets
system.time(
msc <- Reduce(combine, ms)
)
To show the selected isotopes I use some internal functions here:
f <- fData(msc)[, grep("isotopicDistr", fvarLabels(msc))]
x <- .splitIsotopicDistr(unlist(f[22, ]))
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_01
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 16684 6747 1885 430 137906 55855 15305 3352 549
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_03
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 10610 4355 1409 190 136850 57326 16114 1718 325
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_05
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 9983 4544 746 148 136386 56244 15337 3329 645
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_01
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 12127 5041 990 177 148808 60998 16693 3656 577
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_03
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4 2_5
# 8180 3325 927 205 133856 54500 15275 3102 498 81
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_05
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 10893 4425 1365 276 141491 58884 16458 3489 345
#
.commonIsotopes(x, saturationThreshold=1e2)
# character(0)
.commonIsotopes(x, saturationThreshold=1e3)
# [1] "1_3" "2_4"
# 1_2 in run 1 is the only run where 1_2 > 1800 and all 1_2 isotopes are filtered
.commonIsotopes(x, saturationThreshold=1.8e3)
# [1] "1_3" "2_4"
.commonIsotopes(x, saturationThreshold=1.9e3)
# [1] "1_2" "1_3" "2_4"
Question: The current behaviour has some disadvantages. E.g. the third EMRT was seen in 5 of 6 runs with a good isotopic distribution. Because we require that the isotopes must be present in all runs (even when the EMRT was not observed at all), we could not find any common isotopes (because the first run has not any isotope) and could not requantify this EMRT:
x <- .splitIsotopicDistr(unlist(f[3, ]))
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_01
# <NA>
# NA
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_03
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4
# 10256 6073 1915 450 80564 48499 16732 4034 852
#
# $isotopicDistr.MSe_101111_25fmol_UPS1_in_Ecoli_05
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4 2_5
# 9710 5588 1699 426 86293 46817 14378 3842 847 264
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_01
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4 2_5
# 9443 5068 1716 423 87367 47123 16806 4317 992 130
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_03
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4 2_5
# 10260 5859 1842 428 79717 46234 15919 3935 1026 138
#
# $isotopicDistr.MSe_111111_50fmol_UPS1_in_Ecoli_05
# 1_0 1_1 1_2 1_3 2_0 2_1 2_2 2_3 2_4 2_5
# 8439 5036 1629 347 68177 41550 14343 3415 751 122
#
.commonIsotopes(x, saturationThreshold=Inf)
# character(0)
I would suggest to ignore NA runs (where the EMRT was not observed) and requantify all the other runs (so in this example 2 runs out of 6 would be enough for requantification).
I would actually like to suggest the following method to deal with the problem:
imagine we have a peptide at 10, 20, 50, 100 and 500 fmol (this is actually what we see in the organelle proteomics datasets a lot), the problem is that 10 and 500 fmol do not have any isotopes in common that are not over sat threshold in 500, but are still present at 10, so correction is not possible. 50 fmol however has common isotopes with both 10 and 500fmol, so we could quantify 10 vs 50 and 50 vs 500 to solve this problem. Thus my suggestion is:
1) nominate a run where all isotopes are under sat threshold (reference run). If there is several such runs than we could either:
2) requantify only those runs that have ions over sat threshold, i.e. leave 10 as it is, but requantify only 500.
at 1) I would suggest to let the user choose a reference run and don't try to find a run automatically. (If we try to find a run automatically it could happen that the reference run differs across peptides. Or we have to analyse all runs at once.)
at 2) that's sounds reasonable. Additionally I would suggest to ignore NA
runs.
As discussed in our last skype call I add a new requantify method: "reference"
.
It looks for an unsaturated run (none of the isotopes is higher than the saturation threshold) with the highest summed intensity value and use this as reference. A run with intensity values above the saturation threshold would be corrected as follows:
Example:
reference= "1_0:2000;1_1:200;1_2:20;1_3:2;"
satrun = "1_0:6000;1_1:1000;1_2:100;1_3:10;1_4:1"
satThreshold = 5000
# 1. step
satrun[2:4]/reference[2:4] # 1000/200, 100/20, 10/2
# 5, 5, 5
# 2. step
scaleFactor = mean(c(5, 5, 5)) # 5
# 3. step
unsatInt = reference[1]*scaleFactor # 10000
correctedInt = sum(10000, 1000, 100, 10, 1) # 11111
A more complete example (ref is run 3 and sat is run 4)
f <- data.frame(isotopicDistr.run1="1_0:1000;1_1:100;1_2:10;1_3:1;",
isotopicDistr.run2="1_0:800;1_1:100;1_2:10;1_3:1;",
isotopicDistr.run3="1_0:2000;1_1:200;1_2:20;1_3:2;",
isotopicDistr.run4="1_0:6000;1_1:1000;1_2:100;1_3:10;1_4:1",
stringsAsFactors=FALSE)
synapter:::.requantifyReferenceRun(f, saturationThreshold=Inf)
# isotopicDistr.run1 isotopicDistr.run2 isotopicDistr.run3 isotopicDistr.run4
# 1111 911 2222 7111
synapter:::.requantifyReferenceRun(f, saturationThreshold=5000)
# isotopicDistr.run1 isotopicDistr.run2 isotopicDistr.run3 isotopicDistr.run4
# 1111 911 2222 11111
The user visible function (msc
object from the example code some comments above):
refcorrected <- requantify(msc, saturationThreshold=5000, method="reference")
We need to review all saturation correction algorithms. Even "sum" has its bugs.
"Sum" (maybe others too) should get an option to decide whether we want to remove unsaturated isotopes that are below the threshold but not present in all runs (current behaviour) or if we want to keep them:
run1 z1:10, z2:5, z3:1; run2 z2:5
threshold: 20: run1: 15 (or 16, z3 is missing in run2), run2: 5
threshold: 8: run1: 5 (or 6?), run2: 5
A question regarding reference run requantification:
We have the following situation:
reference run: z1_1: 100, z1_2: 10, z1_3
run500mol: z1_1: 2000, z1_2: 1000, z1_3: 500, z2_1: 100, z2_2: 10
If the saturation threshold is 400, the reference run and run500mol have not any unsaturated isotope in common. How we handle this?
Sounds strange? But happens at least 2 times in your example data set, e.g. for GPFPIIV
(second row, second plot):
"If the saturation threshold is 400, the reference run and run500mol have not any unsaturated isotope in common. How we handle this?"
you cant hadle this by reference approach. First some definitions so we are on the same page. Assuming we are talking about a particular peptide: SATURATED runs are those runs that have isotopes over saturation threshold. UNSATURATED runs are those runs that have isotopes below certain threshold. REFERENCE run is unsaturated run with the highest intensity.
The idea of SUM method is to use common isotopes below saturation threshold for requantification. However you will run into the problem there are no common isotopes between certain runs. This can be solved partly by reference method that does not require common isotopes between all runs, but requires common isotopes only between reference run and saturated runs. However there still can be situations in which there are no isotopes in common beow saturation theshold between reference and quantitation run, in which case requantification is impossible.
@pavel-shliaha asked me to look at some wired results of the detector saturation correction using the reference
method.
99.92 % of the signal after reference requantification is from the 500 fmol samples. Why this happen? If you look at the fData(x)$isotopicDistr
columns you can see that there are not any information about the isotopic distribution available for all samples except the 500 fmol ones and a single 75 fmol.
As fData(x)$synapterPlgsAgreement
shows there was no transfer because the grid search could not match ident and quant run. IHMO no requantification could solve this.
What to do? Would it be a good solution not to requantificated any peptides where some sample could not matched in the grid search?
load_all("../../synapter/")
combMSnSet <- readRDS(file.path("result", "combMSnSet.RDS"))
combScheme <- list("25" = 6:7, "50" = 11:13, "75" = 14:16,
"100" = 1:3, "250" = 4:5, "500" = 8:10)
j <- unlist(combScheme)
i <- which(featureNames(combMSnSet) == "DDPHACYSTVFDK")
dd <- combMSnSet[i]
fvarLabels(dd) <- gsub("29022012_Ecoli_spikes_", "", fvarLabels(dd))
iso <- fData(dd)[grepl("isotopicDistr", fvarLabels(dd))][j]
agree <- fData(dd)[grepl("synapterPlgsAgreement", fvarLabels(dd))][j]
rownames(iso) <- rownames(agree) <- NULL
e <- exprs(dd)[, j]
r <- exprs(requantify(dd, method="reference", sat=1e5))[, j]
rownames(e) <- rownames(r) <- NULL
e
# 29022012_Ecoli_spikes_25fmol_04 29022012_Ecoli_spikes_25fmol_06
# 14090 13879
# 29022012_Ecoli_spikes_50fmol_02 29022012_Ecoli_spikes_50fmol_04
# 41539 28718
# 29022012_Ecoli_spikes_50fmol_07 29022012_Ecoli_spikes_75fmol_02
# 47001 47370
# 29022012_Ecoli_spikes_75fmol_04 29022012_Ecoli_spikes_75fmol_06
# 53378 41788
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
# 44478 77878
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
# 70682 159945
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
# 124565 185455
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
# 235531 239742
r
# 29022012_Ecoli_spikes_25fmol_04 29022012_Ecoli_spikes_25fmol_06
# 0 0
# 29022012_Ecoli_spikes_50fmol_02 29022012_Ecoli_spikes_50fmol_04
# 0 0
# 29022012_Ecoli_spikes_50fmol_07 29022012_Ecoli_spikes_75fmol_02
# 0 0
# 29022012_Ecoli_spikes_75fmol_04 29022012_Ecoli_spikes_75fmol_06
# 0 518
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
# 0 0
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
# 0 0
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
# 0 185455
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
# 235531 239742
iso
# isotopicDistr.25fmol_04 isotopicDistr.25fmol_06 isotopicDistr.50fmol_02
# 1 <NA> <NA> <NA>
# isotopicDistr.50fmol_04 isotopicDistr.50fmol_07 isotopicDistr.75fmol_02
# 1 <NA> <NA> <NA>
# isotopicDistr.75fmol_04 isotopicDistr.75fmol_06 isotopicDistr.100fmol_02
# 1 <NA> 2_0:321;2_1:197 <NA>
# isotopicDistr.100fmol_04 isotopicDistr.100fmol_06 isotopicDistr.200fmol_02
# 1 <NA> <NA> <NA>
# isotopicDistr.200fmol_04
# 1 <NA>
# isotopicDistr.500fmol_02
# 1 2_0:40003;2_1:32600;2_2:18058;2_3:6621;2_4:1735;2_5:421;2_6:127;3_0:32190;3_1:25740;3_2:18706;3_3:6798;3_4:1939;3_5:435;3_6:82
# isotopicDistr.500fmol_04
# 1 2_0:53185;2_1:45888;2_2:26425;2_3:9358;2_4:2456;2_5:531;2_6:119;3_0:36456;3_1:29678;3_2:20958;3_3:7640;3_4:2300;3_5:469;3_6:68
# isotopicDistr.500fmol_06
# 1 2_0:53234;2_1:46184;2_2:24468;2_3:8270;2_4:2146;2_5:463;3_0:38756;3_1:34138;3_2:21667;3_3:7529;3_4:2267;3_5:493;3_6:127
agree
# synapterPlgsAgreement.25fmol_04 synapterPlgsAgreement.25fmol_06
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.50fmol_02 synapterPlgsAgreement.50fmol_04
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.50fmol_07 synapterPlgsAgreement.75fmol_02
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.75fmol_04 synapterPlgsAgreement.75fmol_06
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.100fmol_02 synapterPlgsAgreement.100fmol_04
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.100fmol_06 synapterPlgsAgreement.200fmol_02
# 1 no_synapter_transfer no_synapter_transfer
# synapterPlgsAgreement.200fmol_04 synapterPlgsAgreement.500fmol_02
# 1 no_synapter_transfer agree
# synapterPlgsAgreement.500fmol_04 synapterPlgsAgreement.500fmol_06
# 1 agree agree
Same problem as above but just for one 500 fmol sample.
e
# 29022012_Ecoli_spikes_25fmol_04 29022012_Ecoli_spikes_25fmol_06
# 17933 24056
# 29022012_Ecoli_spikes_50fmol_02 29022012_Ecoli_spikes_50fmol_04
# 61520 55124
# 29022012_Ecoli_spikes_50fmol_07 29022012_Ecoli_spikes_75fmol_02
# 55565 74681
# 29022012_Ecoli_spikes_75fmol_04 29022012_Ecoli_spikes_75fmol_06
# 77651 74222
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
# 89107 113455
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
# 105244 225785
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
# 215329 214903
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
# 252642 280309
r
# 29022012_Ecoli_spikes_25fmol_04 29022012_Ecoli_spikes_25fmol_06
# 17933.0 24056.0
# 29022012_Ecoli_spikes_50fmol_02 29022012_Ecoli_spikes_50fmol_04
# 61520.0 55124.0
# 29022012_Ecoli_spikes_50fmol_07 29022012_Ecoli_spikes_75fmol_02
# 55565.0 74681.0
# 29022012_Ecoli_spikes_75fmol_04 29022012_Ecoli_spikes_75fmol_06
# 77651.0 74222.0
# 29022012_Ecoli_spikes_100fmol_02 29022012_Ecoli_spikes_100fmol_04
# 89107.0 113455.0
# 29022012_Ecoli_spikes_100fmol_06 29022012_Ecoli_spikes_200fmol_02
# 105244.0 225785.0
# 29022012_Ecoli_spikes_200fmol_04 29022012_Ecoli_spikes_500fmol_02
# 215329.0 0.0
# 29022012_Ecoli_spikes_500fmol_04 29022012_Ecoli_spikes_500fmol_06
# 252642.0 318415.1
iso
# isotopicDistr.25fmol_04
# 1 2_0:8013;2_1:6519;2_2:2411;2_3:749;2_4:241
# isotopicDistr.25fmol_06
# 1 2_0:10824;2_1:8347;2_2:3532;2_3:861;3_0:197;3_1:175;3_2:120
# isotopicDistr.50fmol_02
# 1 2_0:27650;2_1:21582;2_2:9207;2_3:2400;2_4:502;2_5:179
# isotopicDistr.50fmol_04
# 1 2_0:23662;2_1:19309;2_2:8549;2_3:2520;2_4:674;2_5:244;3_0:108;3_1:58
# isotopicDistr.50fmol_07
# 1 2_0:24159;2_1:19193;2_2:8596;2_3:2628;2_4:802;2_5:187
# isotopicDistr.75fmol_02
# 1 2_0:32999;2_1:26262;2_2:11456;2_3:3084;2_4:654;2_5:226
# isotopicDistr.75fmol_04
# 1 2_0:34799;2_1:27407;2_2:11170;2_3:3322;2_4:727;3_0:144;3_1:82
# isotopicDistr.75fmol_06
# 1 2_0:32917;2_1:25482;2_2:11430;2_3:3056;2_4:830;2_5:227;3_0:170;3_1:110
# isotopicDistr.100fmol_02
# 1 2_0:39872;2_1:30827;2_2:13170;2_3:3771;2_4:972;2_5:236;3_0:188;3_1:71
# isotopicDistr.100fmol_04
# 1 2_0:50963;2_1:39987;2_2:16566;2_3:4480;2_4:1097;3_0:228;3_1:134
# isotopicDistr.100fmol_06
# 1 2_0:46435;2_1:38035;2_2:15497;2_3:4051;2_4:1019;2_5:207
# isotopicDistr.200fmol_02
# 1 2_0:97225;2_1:80863;2_2:35200;2_3:9747;2_4:2216;2_5:455;2_6:79
# isotopicDistr.200fmol_04
# 1 2_0:91501;2_1:76143;2_2:32803;2_3:10120;2_4:2229;2_5:504;2_6:143;3_0:653;3_1:560;3_2:430;3_3:160;3_4:83
# isotopicDistr.500fmol_02
# 1 <NA>
# isotopicDistr.500fmol_04
# 1 1_0:401;1_1:372;2_0:94256;2_1:83636;2_2:51768;2_3:17209;2_4:3779;2_5:807;2_6:104;3_0:128;3_1:100;3_2:82
# isotopicDistr.500fmol_06
# 1 1_0:413;1_1:379;2_0:103359;2_1:95753;2_2:56364;2_3:18257;2_4:4074;2_5:751;2_6:150;3_0:361;3_1:259;3_2:189
agree
# synapterPlgsAgreement.25fmol_04 synapterPlgsAgreement.25fmol_06
# 1 agree agree
# synapterPlgsAgreement.50fmol_02 synapterPlgsAgreement.50fmol_04
# 1 agree agree
# synapterPlgsAgreement.50fmol_07 synapterPlgsAgreement.75fmol_02
# 1 agree agree
# synapterPlgsAgreement.75fmol_04 synapterPlgsAgreement.75fmol_06
# 1 agree agree
# synapterPlgsAgreement.100fmol_02 synapterPlgsAgreement.100fmol_04
# 1 agree agree
# synapterPlgsAgreement.100fmol_06 synapterPlgsAgreement.200fmol_02
# 1 agree agree
# synapterPlgsAgreement.200fmol_04 synapterPlgsAgreement.500fmol_02
# 1 agree no_synapter_transfer
# synapterPlgsAgreement.500fmol_04 synapterPlgsAgreement.500fmol_06
# 1 agree agree
BTW: the "sum" requantification isn't working for these both examples (resulting in zero intensity for all samples).
Thanks for having a look into it. In both cases it is the artifact of plotting requantified results, not of requantification itself. What I mean is that the code to plot requantification is just for the paper it is not a part of basic synapter functionality (unless we want to give visualisaion tool to the user?) About the peptides:
1) for "DDPHACYSTVFDK" there is a plot before ("none") requantifiction and after. If there is no transfer then where does the "none" plot come from? 2) for "IGEEYISDLDQLR" when making the plots IGNORE the peptide measurements in which transfer did not happen, i.e. the 500fmol point should be mean (c (225785.0, 318415.1)), not mean (c (225785.0, 318415.1, 0)). I am still confused why you loose a measurement even if you do not transfer by synapter (I always run code with "rescue" option).
The "none" plot comes from the "rescued" identification data. Because there is no match in the grid search we could not transfer quantitation data. In the current approach we just have the isotopic distribution for the quantitation data (because we don't read the ident pep3d data yet).
It would be a big pain but it seems we need to import the ident pep3d data as well.
Sorry, my fault. It seems that the rescueEMRTs
method rescues just the count data. That's why we lose the isotopic distribution information here.
It is not as easy as I thought. rescueEMRTs
copies data from the merged data (ident peptide data and quant peptide data; used for the lowess model) to the matched data (results of the grid search). But the peptide data don't contain any isotopic distribution information. That's why rescueEMRTs
has to look up at least the isotopic distribution in the pep3D data.
After fixing #108 the isotopicDistr
column is rescued as well and the figures look better (but even now the "sum" seems to be more sufficient than the "reference" method):
@sgibb does the saturation threshold represent intensity for an ion (lives in isotopic distribution column of fData) or for the peptide (lives in expression table)? It has to be consistent in all types of saturation correction.
If the intensity of an ion is higher than saturationThreshold
it is considered as saturated (for all methods).
@sgibb can you also clarify what happend to data that does not need to be corrected? E.g. if saturation threshold is 10k and there is two peptides:
1) should not be corrected (i.e. no ions over 10k in the peptide) 2) should be corrected but cannot be corrected (i.e. there are no ions below 10k to correct)
It seems synapter introduces non 0 (and non NA) values into expression matrix, when it corrects saturation by th methods). E.g.
I think this happens because correction works with the isotopicDistr columns and they contain data even if the correpsonding sample has NA in expression table.
Could we please make it so, that if a sample had an NA in expression data for a particular peptide it will still be NA after correction?
saturation should be corrected focusing on particular ions (i.e. particular isotope and charge state), i.e. that synapter:::.sumIsotopes() should be removed and the function should be modified to work with individual charge states.
NA
s) is fixed.## use the same example as before (without any saturation threshold)
poi <- combMSNset[grep("^LAMQEFMILPVGAANFR$", featureNames(combMSNset)),]
iso <- fData(poi)[, grep("isotopicDistr", colnames(fData(combMSNset)))]
## just use the first run because there are no missing values
m <- synapter:::.isotopicDistr2matrix(iso)[1,]
m
# 2_0 2_1 2_2 2_3 2_4 2_5 2_6 3_0 3_1 3_2 3_3 3_4 3_5
# 44772 52207 36735 15228 4524 1147 244 41226 44926 36857 21123 9240 2494
# 3_6 3_7 4_0 4_1 4_2
# 324 93 71 122 84
## find the isotope number
iIsotope <- as.numeric(gsub("^[0-9]+_([0-9]+)$", "\\1", colnames(x))) + 1L
iIsotope
# [1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 1 2 3
nIsotopes <- max(iIsotope)
nIsotopes
# [1] 8
## calculate theoretical isotopic distribution
props <- BRAIN::calculateIsotopicProbabilities(
BRAIN::getAtomsFromSeq("LAMQEFMILPVGAANFR"), nrPeaks=nIsotopes)
props
# [1] 0.3040745888 0.3230766302 0.2112500172 0.1026649967 0.0403559650 0.0134161742
# [7] 0.0038789509 0.0009942712
m/props
# 2_0 2_1 2_2 2_3 2_4 2_5
# 1.472402e+05 1.615932e+05 1.738935e+05 1.483271e+05 1.121024e+05 8.549382e+04
# 2_6 3_0 3_1 3_2 3_3 3_4
# 6.290361e+04 4.146354e+07 1.477466e+05 1.140813e+05 9.999052e+04 9.000146e+04
# 3_5 3_6 3_7 4_0 4_1 4_2
# 6.180003e+04 2.414995e+04 2.397555e+04 7.140909e+04 4.012173e+02 2.600002e+02
## estimated counts (nearly 1/3 of the true counts)
mean(m/props[iIsotope])
# [1] 118852
median(m/props[iIsotope])
# [1] 137317.7
## using the current sum approach
s <- synapter:::.sumIsotopes(t(m))
s
# 0 1 2 3 4 5 6 7
# [1,] 86069 97255 73676 36351 13764 3641 568 93
## estimated counts (even underestimated but nearly the same as the true counts)
mean(s/props)
# [1] 267417.1
median(s/props)
# [1] 292040
exprs(poi)[1]
# [1] 311417
An alternative would be to estimate the mean
/median
/weighted.mean
counts for each charge state individually and sum them up afterwards, e.g.:
charge <- as.numeric(gsub("^([0-9]+)_[0-9]+$", "\\1", colnames(x)))
thmean <- tapply(m/props[iIsotope], charge, mean)
thmean
# 2 3 4
# 127364.8301 155846.7826 336.2493
sum(thmean)
# [1] 283547.9
thmed <- tapply(m/props[iIsotope], charge, median)
thmed
# 2 3 4
# 147240.1892 156763.8922 377.6194
sum(thmed)
# [1] 304381.7
actually now that I think about it, the best way is to use all ions below saturation thsehold individually without minding the charge state.So if you have
1_0, 1_1, 1_2 2_0, 2_1, 2_2
and 2_0 is below the threshold, use 1_0, 1_1, 1_2, 2_1, 2_2, just as you dd before, but dont sum isotopes across charge states
@sgibb another request (sorry). I think there is another problem, we need to only repredict intensity of the peptide onlyin the runs in which it is saturated. It seems that there is a lot of noise when repredicting from low intensity peptide. I.e. if there is a situation: peptide 1 is 5e3 in run 1 and 5e5 in run 2 and saturation threshold is 1e4, then 5e3 should not be touched (no ions over saturation threshold) and 5e5 should be requantified since it does have ions over saturation threshold.
PS1: Note that noise from reprediction is additive, i.e. if there is an error repredicting 5e3 and an error repredicting 5e5, then the ratio might have additive error. I think that this is shy we have so much "noise" in Kuharev's data
PS2: if a user does want to repredict all, then he can use requantifyAll = TRUE
@pavel-shliaha IMHO this is already working as you want it:
msn <- readRDS("refCombMSNSet.RDS")[1, 1:2]
## store and rewrote fData (MSnbase subsetting doesn't remove the other
## isotopicDistr columns)
iso <- fData(msn)[, c("isotopicDistr.S130423_05", "isotopicDistr.S130423_07")]
fData(msn) <- iso
iso
# LAMQEFMILPVGAANFR 2_0:44772;2_1:52207;2_2:36735;2_3:15228;2_4:4524;2_5:1147;2_6:244;3_0:41226;3_1:44926;3_2:36857;3_3:21123;3_4:9240;3_5:2494;3_6:324;3_7:93;4_0:71;4_1:122;4_2:84
# isotopicDistr.S130423_07
# LAMQEFMILPVGAANFR 2_0:37377;2_1:39573;2_2:25620;2_3:10465;3_0:34237;3_1:34126;3_2:30939;3_3:19062;3_4:6166
m <- synapter:::.isotopicDistr2matrix(iso)
satTh <- 5e4
## just one ion in the run 5 is saturated
m > satTh
# 2_0 2_1 2_2 2_3 2_4 2_5 2_6 3_0 3_1
# isotopicDistr.S130423_05 FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# isotopicDistr.S130423_07 FALSE FALSE FALSE FALSE NA NA NA FALSE FALSE
# 3_2 3_3 3_4 3_5 3_6 3_7 4_0 4_1 4_2
# isotopicDistr.S130423_05 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# isotopicDistr.S130423_07 FALSE FALSE FALSE NA NA NA NA NA NA
exprs(msn)
# S130423_05 S130423_07
# LAMQEFMILPVGAANFR 311417 237565
## requantify all (even those runs that are not saturated)
msnT <- requantify(msn, method="th.mean", saturationThreshold=satTh, requantifyAll=TRUE)
exprs(msnT)
# S130423_05 S130423_07
# LAMQEFMILPVGAANFR 277843.1 257783.3
## requantify just saturated once (please not run 7 was not affacted)
msnF <- requantify(msn, method="th.mean", saturationThreshold=satTh, requantifyAll=FALSE)
exprs(msnF)
# S130423_05 S130423_07
# LAMQEFMILPVGAANFR 277843.1 237565
@sgibb I think I finally understand what the problem might be:
when you perform combination of th table you use all charge states: e.g.
what this results in is that you predict 2_0 using 1_0 and 1_1, but you should only be using charge state 2, not 1. I understand that the idea of th is that:
1) there is a theoretical peptide distribution: lets call it thDistr 2) real peptide distribution is the theoretical distribution multiplied by a certain constant K, so observed distribution is: obsDistr = thDistr * K 3) you calculate K from unsaturated ions K = obsDistr[unsat]/thDistr[unsat] and then you multiply it by theoretical proportion of saturated ions obsDistr[sat] = thDistr[sat] * K 4) however K is ONLY constant within 1 charge state. E.g. K for charge 1 is mean (c(2655, 2249)) and K for charge 2 is mean (c(52799, 31883)) = 42341.41. But you do mean (c(2655, 2249, 52799, 31883)) = 22396.91. So you multiply obsDistr[sat]* 22396.91 instead of obsDistr[sat]* 42341.41. This results in a paradox with less intensity when you perform calculation then when you started and all the noise we see.
I think you now understand the algorithm but your point 4) is wrong.
Just to ensure that we talking about the same piece of code. The following code is the same I send you yesterday:
library("synapter")
combMSNset <- readRDS("refCombMSNSet.RDS")
## fetch peptide of interest
poi <- combMSNset[grep("^MTDLDLAGK$", featureNames(combMSNset)),]
## grep isotopic distribution
iso <- fData(poi)[, grep("isotopicDistr", fvarLabels(combMSNset))]
## turn into a matrix
m <- synapter:::.isotopicDistr2matrix(iso)
m
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3 1_3 2_4
# [1,] 1511 629 30859 14763 3525 NA NA NA NA
# [2,] 1166 470 31846 16087 4644 184 NA NA NA
# [3,] 1172 441 30856 15004 3296 191 319 NA NA
# [4,] 371 155 25275 11440 3737 NA NA NA NA
# [5,] 950 349 22497 11292 2408 NA NA NA NA
# [6,] 8390 3343 75227 51491 22659 1148 NA 203 NA
# [7,] 7569 2843 72229 51891 25683 849 NA 170 NA
# [8,] 8689 3326 66247 48676 23591 869 NA 286 NA
# [9,] 8593 3078 56700 41174 20213 976 NA NA NA
# [10,] 8451 2956 56708 41278 20537 878 3640 158 880
m <- synapter:::.isotopicDistr2matrix(f)
m
saturationThreshold <- 3e4
# which ions are below threshold?
unsat <- m < saturationThreshold
unsat
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3 1_3 2_4
# [1,] TRUE TRUE FALSE TRUE TRUE NA NA NA NA
# [2,] TRUE TRUE FALSE TRUE TRUE TRUE NA NA NA
# [3,] TRUE TRUE FALSE TRUE TRUE TRUE TRUE NA NA
# [4,] TRUE TRUE TRUE TRUE TRUE NA NA NA NA
# [5,] TRUE TRUE TRUE TRUE TRUE NA NA NA NA
# [6,] TRUE TRUE FALSE FALSE TRUE TRUE NA TRUE NA
# [7,] TRUE TRUE FALSE FALSE TRUE TRUE NA TRUE NA
# [8,] TRUE TRUE FALSE FALSE TRUE TRUE NA TRUE NA
# [9,] TRUE TRUE FALSE FALSE TRUE TRUE NA NA NA
# [10,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
# multiplie ion matrix by unsat matrix (saturated ions are treated as zero)
m <- m * unsat
m
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3 1_3 2_4
# [1,] 1511 629 0 14763 3525 NA NA NA NA
# [2,] 1166 470 0 16087 4644 184 NA NA NA
# [3,] 1172 441 0 15004 3296 191 319 NA NA
# [4,] 371 155 25275 11440 3737 NA NA NA NA
# [5,] 950 349 22497 11292 2408 NA NA NA NA
# [6,] 8390 3343 0 0 22659 1148 NA 203 NA
# [7,] 7569 2843 0 0 25683 849 NA 170 NA
# [8,] 8689 3326 0 0 23591 869 NA 286 NA
# [9,] 8593 3078 0 0 20213 976 NA NA NA
# [10,] 8451 2956 0 0 20537 878 3640 158 880
# find charge and isotope number from column names
cn <- as.numeric(unlist(strsplit(colnames(m), "_", fixed=TRUE)))
sel <- as.logical(seq_along(cn) %% 2L)
charges <- cn[sel]
# [1] 1 1 2 2 2 1 2 1 2
## add + 1L because indices in R start with 1
iIsotopes <- cn[!sel] + 1L
# [1] 1 2 1 2 3 3 4 4 5
## find maximal number of isotopes
nIsotopes <- max(iIsotopes)
# [1] 5
## use BRAIN to caluclation isotopic distribution for 5 isotopes
probs <- BRAIN::calculateIsotopicProbabilities(
BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=nIsotopes)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698
## divide each count by its corresponding proportion
th <- t(t(m)/probs[iIsotopes])
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3
# [1,] 2655.2029 2249.6011 0.00 52799.46 31883.36 NA NA
# [2,] 2048.9520 1680.9420 0.00 57534.71 42004.63 1664.266 NA
# [3,] 2059.4955 1577.2243 0.00 53661.39 29812.07 1727.580 10088.86
# [4,] 651.9393 554.3532 44414.46 40914.84 33800.88 NA NA
# [5,] 1669.3863 1248.1888 39532.83 40385.53 21780.18 NA NA
# [6,] 14743.3170 11956.1468 0.00 0.00 204948.93 10383.573 NA
# [7,] 13300.6158 10167.9107 0.00 0.00 232300.78 7679.140 NA
# [8,] 15268.7344 11895.3468 0.00 0.00 213378.80 7860.039 NA
# [9,] 15100.0385 11008.3817 0.00 0.00 182825.05 8827.846 NA
# [10,] 14850.5092 10572.0521 0.00 0.00 185755.60 7941.443 115120.53
# 1_3 2_4
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
# [4,] NA NA
# [5,] NA NA
# [6,] 6420.183 NA
# [7,] 5376.508 NA
# [8,] 9045.184 NA
# [9,] NA NA
# [10,] 4996.990 118939.8
## replace the zeros (saturated ions) by NA
th[th == 0L] <- NA_real_
th
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3
# [1,] 2655.2029 2249.6011 NA 52799.46 31883.36 NA NA
# [2,] 2048.9520 1680.9420 NA 57534.71 42004.63 1664.266 NA
# [3,] 2059.4955 1577.2243 NA 53661.39 29812.07 1727.580 10088.86
# [4,] 651.9393 554.3532 44414.46 40914.84 33800.88 NA NA
# [5,] 1669.3863 1248.1888 39532.83 40385.53 21780.18 NA NA
# [6,] 14743.3170 11956.1468 NA NA 204948.93 10383.573 NA
# [7,] 13300.6158 10167.9107 NA NA 232300.78 7679.140 NA
# [8,] 15268.7344 11895.3468 NA NA 213378.80 7860.039 NA
# [9,] 15100.0385 11008.3817 NA NA 182825.05 8827.846 NA
# [10,] 14850.5092 10572.0521 NA NA 185755.60 7941.443 115120.53
# 1_3 2_4
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
# [4,] NA NA
# [5,] NA NA
# [6,] 6420.183 NA
# [7,] 5376.508 NA
# [8,] 9045.184 NA
# [9,] NA NA
# [10,] 4996.990 118939.8
## th.mean: calculate mean for each charge state
r <- t(MSnbase:::utils.applyColumnwiseByGroup(t(th), groupBy = charges,
colMeans, na.rm=TRUE))
# 1 2
# [1,] 2452.4020 42341.41
# [2,] 1798.0533 49769.67
# [3,] 1788.1001 31187.44
# [4,] 603.1462 39710.06
# [5,] 1458.7876 33899.51
# [6,] 10875.8049 204948.93
# [7,] 9131.0438 232300.78
# [8,] 11017.3261 213378.80
# [9,] 11645.4220 182825.05
# [10,] 9590.2486 139938.66
## (this MSnbase function is equivalent to (but the former is more general):
r <- cbind(rowMeans(th[, c(1:2, 6, 8)], na.rm=TRUE),
rowMeans(th[, c(3:5, 7, 9)], na.rm=TRUE))
# [,1] [,2]
# [1,] 2452.4020 42341.41
# [2,] 1798.0533 49769.67
# [3,] 1788.1001 31187.44
# [4,] 603.1462 39710.06
# [5,] 1458.7876 33899.51
# [6,] 10875.8049 204948.93
# [7,] 9131.0438 232300.78
# [8,] 11017.3261 213378.80
# [9,] 11645.4220 182825.05
# [10,] 9590.2486 139938.66
## finally sum the charge states to get the final counts
r <- rowSums(r, na.rm=TRUE)
# [1] 44793.81 51567.72 32975.54 40313.21 35358.30 215824.73 241431.82
# [8] 224396.12 194470.47 149528.91
rAllT <- r
## if requantifyAll != TRUE: find all runs without any saturation
runUnsat <- which(rowSums(!unsat, na.rm=TRUE) == 0L)
# [1] 4 5
## and replace their theoritical values with the original experimental ones
r[runUnsat] <- rowSums(m[runUnsat, , drop=FALSE], na.rm=TRUE)
rAllF <- r
# [1] 44793.81 51567.72 32975.54 40978.00 37496.00 215824.73 241431.82
# [8] 224396.12 194470.47 149528.91
rAllF == rAllT
# [1] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## compare to original algorithm
all.equal(rAllT, as.vector(exprs(requantify(poi, method="th.mean", saturationThreshold=saturationThreshold, requantifyAll=TRUE))))
# [1] TRUE
all.equal(rAllF, as.vector(exprs(requantify(poi, method="th.mean", saturationThreshold=saturationThreshold, requantifyAll=FALSE))))
# [1] TRUE
Now regarding your points:
1) there is a theoretical peptide distribution: lets call it
thDistr
Correct, we use the BRAIN
package to get this distribution from the peptide sequence (code is taken from above):
## use BRAIN to caluclation isotopic distribution for 5 isotopes
probs <- BRAIN::calculateIsotopicProbabilities(
BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=nIsotopes)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698
2) real peptide distribution is the theoretical distribution multiplied by a certain constant K, so observed distribution is:
obsDistr = thDistr * K
3) you calculate K from unsaturated ions K = obsDistr[unsat]/thDistr[unsat] and then you multiply it by theoretical proportion of saturated ions obsDistr[sat] = thDistr[sat] * K
More or less correct, but I would say it in a slightly different way:
The exists the true counts/intensity (ti
) for a peptide but we didn't know it. ti
is the sum of the isotope intensities (ii
): ti = ii_1_0 + ii_1_1 + ii_2_0 + ii_2_1 + ii_... + ii_maxCharge_maxIsotopes
We try to use the observed distribution to get ti
. In a perfect experiment without bias the observed distribution would be the same as the theoretical one. That's why we estimate ti
by the ratio of observed and theoretical distribution: ti = obsDistr/thDistr
but we do that for each ion and charge state individually: ti = mean(ii_1_0/thDistr_0, ii_1_1/thDistr_1) + mean(ii_2_0/thDistr_0, ii_2_1/thDistr_1, ...) + ...
(that is IMHO exactly what you want to do in your 4) point; see below; you point 4 was correct for the code one week ago, which was indeed not correct)
Find the relevant code in the following lines:
## divide each count by its corresponding proportion
th <- t(t(m)/probs[iIsotopes])
# 1_0 1_1 2_0 2_1 2_2 1_2 2_3
# [1,] 2655.2029 2249.6011 0.00 52799.46 31883.36 NA NA
# [2,] 2048.9520 1680.9420 0.00 57534.71 42004.63 1664.266 NA
# [3,] 2059.4955 1577.2243 0.00 53661.39 29812.07 1727.580 10088.86
# [4,] 651.9393 554.3532 44414.46 40914.84 33800.88 NA NA
# [5,] 1669.3863 1248.1888 39532.83 40385.53 21780.18 NA NA
# [6,] 14743.3170 11956.1468 0.00 0.00 204948.93 10383.573 NA
# [7,] 13300.6158 10167.9107 0.00 0.00 232300.78 7679.140 NA
# [8,] 15268.7344 11895.3468 0.00 0.00 213378.80 7860.039 NA
# [9,] 15100.0385 11008.3817 0.00 0.00 182825.05 8827.846 NA
# [10,] 14850.5092 10572.0521 0.00 0.00 185755.60 7941.443 115120.53
# 1_3 2_4
# [1,] NA NA
# [2,] NA NA
# [3,] NA NA
# [4,] NA NA
# [5,] NA NA
# [6,] 6420.183 NA
# [7,] 5376.508 NA
# [8,] 9045.184 NA
# [9,] NA NA
# [10,] 4996.990 118939.8
4) however K is ONLY constant within 1 charge state. E.g. K for charge 1 is mean (c(2655, 2249)) and K for charge 2 is mean (c(52799, 31883)) = 42341.41. ...
As you could see below, that are exactly the same numbers and the same approach we use currently:
r <- cbind(rowMeans(th[, c(1:2, 6, 8)], na.rm=TRUE),
rowMeans(th[, c(3:5, 7, 9)], na.rm=TRUE))
# [,1] [,2]
# [1,] 2452.4020 42341.41
# [2,] 1798.0533 49769.67
# [3,] 1788.1001 31187.44
# [4,] 603.1462 39710.06
# [5,] 1458.7876 33899.51
# [6,] 10875.8049 204948.93
# [7,] 9131.0438 232300.78
# [8,] 11017.3261 213378.80
# [9,] 11645.4220 182825.05
# [10,] 9590.2486 139938.66
## finally sum the charge states to get the final counts
r <- rowSums(r, na.rm=TRUE)
# [1] 44793.81 51567.72 32975.54 40313.21 35358.30 215824.73 241431.82
# [8] 224396.12 194470.47 149528.91
@sgibb yeah sorry, I agree. I misread the algorithm
Just for the record: We tried different packages for the theoretical isotopic distribution. They are all more or less the same but of course different from the observed one:
BRAIN::calculateIsotopicProbabilities(BRAIN::getAtomsFromSeq("MTDLDLAGK"), nrPeaks=5)
# [1] 0.569071397 0.279605131 0.110559250 0.031619035 0.007398698
a <- BRAIN::getAtomsFromSeq("MTDLDLAGK")
form <- paste0(names(a), a, collapse="")
Rdisop::getIsotope(Rdisop::getMolecule(form, z = 0, maxisotopes = 5))[2,]
# [1] 0.562914475 0.284686413 0.112407334 0.032384339 0.007607439
set.seed(1234)
org <- OrgMassSpecR::IsotopicDistributionN("MTDLDLAGK", 0)$intensity
org/sum(org)
# [1] 0.5894 0.2685 0.1047 0.0286 0.0072 0.0014 0.0002
# charge 1
m[, c(1:2, 6, 7)]/rowSums(m[, c(1:2, 6, 7)], na.rm=TRUE)
# 1_0 1_1 1_2 2_3
# isotopicDistr.S130423_05 0.7060748 0.2939252 NA NA
# isotopicDistr.S130423_07 0.6406593 0.2582418 0.10109890 NA
# isotopicDistr.S130423_09 0.5520490 0.2077249 0.08996703 0.1502591
# isotopicDistr.S130423_11 0.7053232 0.2946768 NA NA
# isotopicDistr.S130423_13 0.7313318 0.2686682 NA NA
# isotopicDistr.S130423_06 0.6513469 0.2595295 0.08912352 NA
# isotopicDistr.S130423_08 0.6721428 0.2524643 0.07539295 NA
# isotopicDistr.S130423_10 0.6744024 0.2581496 0.06744800 NA
# isotopicDistr.S130423_12 0.6794497 0.2433779 0.07717245 NA
# isotopicDistr.S130423_14 0.5306750 0.1856201 0.05513344 0.2285714
# charge 2
m[, c(3:5, 7, 9)]/rowSums(m[, c(3:5, 7, 9)], na.rm=TRUE)
# 2_0 2_1 2_2 2_3 2_4
# isotopicDistr.S130423_05 0.6278918 0.3003846 0.07172360 NA NA
# isotopicDistr.S130423_07 0.6057021 0.3059703 0.08832760 NA NA
# isotopicDistr.S130423_09 0.6236685 0.3032643 0.06661950 0.006447701 NA
# isotopicDistr.S130423_11 0.6248146 0.2828043 0.09238109 NA NA
# isotopicDistr.S130423_13 0.6215156 0.3119596 0.06652485 NA NA
# isotopicDistr.S130423_06 0.5036050 0.3447050 0.15169002 NA NA
# isotopicDistr.S130423_08 0.4821599 0.3463949 0.17144516 NA NA
# isotopicDistr.S130423_10 0.4782693 0.3514157 0.17031491 NA NA
# isotopicDistr.S130423_12 0.4801545 0.3486751 0.17117041 NA NA
# isotopicDistr.S130423_14 0.4608795 0.3354762 0.16690913 0.029583154 0.007151971
It seems that the higher isotopes are less abundant as they should be? @pavel-shliaha Could there be something similar to saturation detection on the lower border?
Reference quantitation method was introduced to minimise the number of peptides that could not be requantified by sum method. By defenition of reference method it is clear that it should requantify the same amount or more of saturated peptides. However working with Kuhrev dataset there are 80 peptides that are successfully requantified by sum, but whih have all NAs in the reference method. These peptides are below: the code is available in:
"Y:\RAW\pvs22_QTOF_DATA_data3\synapter2paper\kuharev2015\bugs_investigation\for_bug_investigation_requant_reference"
[1] "FGLSLVR" "VPTANVSVVDLTCR" "LGANAILGVSMAAAR" "LAATIAQLPDQIGAK"
[5] "FNVLASQPADFDR" "LVSWYDNETGYSNK" "GSVNPECTLAQLGAAK" "AGAGHSNTLQVSTV"
[9] "VFLENVIR" "ISGLIYEETR" "LVNELTEFAK" "GADFLVTEVENGGSLGSK"
[13] "VNFAMNVGK" "AEWALR" "FVPSKPMCVEAFSEYPPLGR" "YQVTVIDAPGHR"
[17] "QIVFEIPSETH" "ADLMLYVSK" "ALEEAGAEVEVK" "VPLPPLTEER"
[21] "VTVQSLDVVR" "IWDSTDALELK" "VVLDDKDYFLFR" "IPFLVK"
[25] "LQTMVSHFTIDPSR" "LLLDV" "FSVSPVVR" "ITSVNVGGMAFR"
[29] "AYGPGIEPTGNMVK" "VLDEVVVDNFDQK" "VLDTGGPISVPVGR" "YSIGQPFVYPR"
[33] "EANNFLWPFK" "QINFGGTVETVYER" "AAIEDGWVPGK" "VGFFNPIASEK"
[37] "GTVTDFPGFDER" "VLAVGDGIAR" "ILFDYSK" "GFGFITPEDGSK"
[41] "LSCFAQTVSPAEK" "VNVYGGAVALGHPLGCSGAR" "FPAIIYGGK" "IWEEGSDEVLVK"
[45] "LGDLPTSGQIR" "AFFANPVLTGAVDK" "GTLEDPNLFIR" "LPQLGIEFSGPGAK"
[49] "IFVDEGPSMK" "LVEALCAEHQINLIK" "LPEAGEDLELK" "LLDAQLSTGGIVDPSK"
[53] "TYSYLTPDLWK" "IPVLEQELVR" "YQAFTQADLTNLR" "LQAFSAAIESCNK"
[57] "TLAQEVLGTTK" "SLVNTYQEILK" "AINQGGLTSVAVR" "GFGFGQGAGALVHSE"
[61] "LLCGLLAER" "FEASQVTEK" "QATFEEMIAR" "LCSLFYTNEEVAK"
[65] "LITPAVVSER" "FGPIVSASLEK" "VLEGGIVIEDR" "IVDVVGTLSR"
[69] "IEFELYDNVVPK" "ILTEPNASITVQYK" "TSGTLEMNLK" "LGLLTETELAK"
[73] "GSFPK" "VATEFSETAPATLK" "VVPVP" "EASGVFDDLVR"
[77] "TVAVAGNPTDWK" "FATLGIDPPK" "LAAQEVELK" "EGVLLSK"
I don't check all peptides but there are at least the following problems in the current implementation:
"reference"
looks for a an unsaturated run as reference. If there is none it will return NA
, e.g. happens for "FGLSLVR"
(we could easily change to the run with the highest intensity regardless of staturation or the highest number of unsaturated ions)."reference"
looks for common unsaturated ion between reference and the current run. If there is at least one run that has no unsaturated ions in common it will return NA
for all runs, e.g. happens for "VPTANVSVVDLTCR"
(we could easily change this to return just NA
for the run without any common ions).about theoretical methods. It is interesting, but it appears that there is a systematic intensity bias against lower intensity ions. To invetigate this I have taken all peptides that have all ions below saturation threshold in Kuharev's dataset and compared the observed proportion of an isotopes among all isotopes and then emulated what I would expect theoretically using BRAIN package. Shown below is a plot that has
1) x axis log2 of intensity for an ion 2) y axis log2 of ratio of observed proportion for an isotope against predicted, again only peptides having all ions below saturation threshold have been concidered.
what it shows is that there is a systematic bias against low intensity isotopes, introducing systematic error in requantification
This could probably be accounted for using a similar approach than what we do for rt adjustment: regress with lowess
and use it to correct/predict after correction for saturation.
@lgatto thanks for the suggestion. @sgibb as we discussed yesterday the sum and refenrece detector saturation models as implemented now do not allow to do top3 quantitation. This happens since we simply through out isotopes over saturation threshold, decreasing intensities of the most abundant peptides. I propose a solution analogous to the one we use in theoretical modeling for isotopes for detector saturation: but in this case intensities of the peptides in the runs with detector saturation are modelled on intensities of peptides with ions under saturation threshold using ratios predicted by requantification. Assume we have 6 runs and there is a peptide: 3 times its under detector saturation and 3 times above. We use sum saturation correction and it requantifies as it does now. We then 1) convert the new requantified peptide intensities to PROPORTIONS x/sum(x). 2) calculate a CORRECTION FACTOR which is the ratio of intensities of peptides below saturation BEFORE requantitation to their PROPORTIONS, CORRECTION FACTOR is mean, median or weighted mean of these rations. We then multiply PROPORTIONS of saturated peptides by CORRECTION FACTOR . There should be an option to only do this requantifiaction for peptides for which there are isotopes over saturation threshold, leaving all peptide intensities for which there are no isotopes over saturation threshold intact.
- you need to change that to look for a run that has a maximum number of unsaturated isotopes (not completely unsaturated).
done.
- this is inveitable, but do I understand correctly that sum in this situation would do the same.
yes, but there was a difference between reference and sum because I used na.rm=TRUE
in sum.
That results in counts that were 0 instead of NA. Is now fixed as well.
@lgatto I would assume that the lowess
approach would just slightly improve the situation because the deviations are very large and nearly equally distributed on both sides of the horizontal zero line.
@pavel-shliaha Am I right that you ask me to implement the following:
prop <- exprs(MSnSetPostRequant)/rowSums(exprs(MSnSetPostRequant))
correctionFactor <- exprs(MSnSetPreRequant)/prop
exprs(MSnSetPostRequant) <- mean(correctionFactor) * prop
To be honest this sounds very weird. Why top3 should work for this rescaled values but not for requantified. (I am aware that the sum method removes the high intense ions and top3 could not be valid anymore (or becomes middle3 or low3)).
@lgatto I agree with sebastian, I think there might be too much spread to accurately reauqntify even if the lowess lines overlapped with 0 horisontal lines
@sgibb 1)yes that is what I would like to do. But I would like only peptides with ions over saturation to be corrected
2) I can explain why top3 will not work if you simply remove high intense ions. The reason is that top3 reflects quantities difference between proteins WITHIN a run (i.e. absolute quantitation). This information might be sometimes useful, e.g. when you are interested what proteins are most abundant in your runs. Removing high intensity isotopes disproportinately penalises most abundant proteins. Also some people (like Kuharev) like to perform top3 to then convert this to relative protein abundances BETWEEN runs. They will then complain that algorithm distorts the values as well. I think top3 is not a good way to convert the peptide quantitation to protein quantitation,there are much better ways, but for now we need to make top3 also possible to be able to compare our results with Kuharev's
@pavel-shliaha So there is a rescaleForTop3
function now for you to test. If it is useful and correct I will change the name and add some documentation. You use this function as follows:
newMSnSet <- rescaleForTop3(originalMSnSet, requantifiedMSnSet,
saturationThreshold, onlyForSaturatedRuns=TRUE)
(it uses the mean correction factor; we could provide more methods later.)
@sgibb still too peptides are requantified by sum but not by reference: "FFQELR" "ESNEITIIINPYRETVCFSVEPVK". Can u have a look?
It is the other way around: both are requantified by reference but not by sum (what makes absolutely sense to me):
## fetch peptide of interest
poi <- combMSNset[grep("^FFQELR$", featureNames(combMSNset)),]
## grep isotopic distribution
iso <- fData(poi)[, grep("isotopicDistr", fvarLabels(combMSNset))]
m <- synapter:::.isotopicDistr2matrix(iso)
m
# 1_0 1_1 1_2 1_3 1_4 2_0 2_1 2_2 2_3
# isotopicDistr.S130423_05 632 307 NA NA NA 6099 2445 696 NA
# isotopicDistr.S130423_07 NA NA NA NA NA NA NA NA NA
# isotopicDistr.S130423_09 37909 16706 2711 501 174 8502 3463 677 125
# isotopicDistr.S130423_11 NA NA NA NA NA NA NA NA NA
# isotopicDistr.S130423_13 NA NA NA NA NA NA NA NA NA
# isotopicDistr.S130423_06 NA NA NA NA NA NA NA NA NA
# isotopicDistr.S130423_08 NA NA NA NA NA 5538 1675 NA NA
# isotopicDistr.S130423_10 NA NA NA NA NA 4842 1439 484 NA
# isotopicDistr.S130423_12 NA NA NA NA NA 4128 1625 371 NA
# isotopicDistr.S130423_14 NA NA NA NA NA 3386 1290 349 NA
exprs(requantify(poi, method="sum", saturationThreshold=3e4))
# S130423_05 S130423_07 S130423_09 S130423_11 S130423_13 S130423_06
# FFQELR NA NA NA NA NA NA
# S130423_08 S130423_10 S130423_12 S130423_14
# FFQELR NA NA NA NA
exprs(requantify(poi, method="reference", saturationThreshold=3e4))
# S130423_05 S130423_07 S130423_09 S130423_11 S130423_13 S130423_06
# FFQELR 10179 NA 70768 NA NA NA
# S130423_08 S130423_10 S130423_12 S130423_14
# FFQELR NA 6765 6124 NA
The prototype of suggested function:
requantify <- function (synapterObj , satThreshold, minIsotopes)
synapterObj - list of synapter objects satThreshold - intensity over which accurate ion recording is not possible, due to saturation minIsotopes - minimum number of isotopes accepted for requantitation
For every peptide the function will find common unsaturated ions from supplied synapter objects, and requqntify EMRT using these ions only. satThreshold is used to determine which ions saturate. minIsotopes is the minimum number of isotopes that are sufficient to requantify (i.e. sometimes only few isotopes will be seen in all samples under saturation and over LOD)
Important: the information on peptide isotopes is not presented in current synapter objects! Pep3D file that is loaded in synapter is filtered so that for each EMRT only one isotope is preserved. This means that synapter object will have to be modified to include this information. Current consensus is that a list should be created each element of which represents a single EMRT and contains information on ion intensities from those EMRTs