Closed mlocardpaulet closed 1 month ago
OK, I get it now. Stupid of me not to see it. The ones with extra low signal are the non-phospho:
But some should disappear instead of having negative values. What do you think?
Also, here the signal difference is wayyyy too high. But I suspect that it could be corrected with a parameter. EnrichmentLoss
maybe? (here it is 0.25 -- it may not be the same example as above...)
The EnrichmentLoss
divides the total signal of all (!) peptides into enriched and non-enriched.
In your picture, the few phosphorylated peptides get 75% of the total intensity and the many non-modified ones get the 25%. As there are many more non-modified peptides, the difference becomes so drastic.
Not sure what would be better. a) taking the total intensities like here or b) make it proportional to the number of peptides such that a phosphorylated peptide is in average that much larger than a non-modified one.
hum... I see. This signal split is a bit weird, isn't it? It makes it difficult to decide on a value for EnrichmentLoss
. I naively thought that it was a loss of signal applied to the non-phosphorylated. peptides.
Here is how the same plot looks on one data set that I have: It is not so divided (and actually here, only a few phosphorylated peptides are found in the non phospho, but due to other things...).
maybe this parameter could shift the distribution of non-phosphorylated peptides to the left? I mean, this is what you do already 😅 , but we could give the value of the shift instead of a value to split the data set?
Also, I think that there may be something weird with EnrichmentEfficiency
. I only use values ≥ 0.8 but you can see in these plots that there are not many phorphorylated peptides. At least less than 80%
The idea of enrichment would be to enhance the signals of the phosphorylated peptides. This is why I kept the total intensity of all peptides. Fixing the differences of averages between phosphorylated and non-phosphorylated peptides is then easier to use as one then can "predict" the outcome from the parameter. Should I do that?
When you use the 80%, can it be that you in general do not have many phosphorylated peptides?
In experimental data, we calculate enrichment as number of phosphorylated peptide / total number of peptides and in general we have 0.7 to 0.95. So we really "loose" non-phosphorylated peptides. I think that we should do the same.
Your last point is a good one. I'll look into it.
Maybe we can use the parameter as differences of averages between phosphorylated and non-phosphorylated peptides, and then have a step to remove the less abundant ones? These would naturally be non-phosphorylated.
I assumed the less abundant to be anyways be removed in the MSRun part. Otherwise, we would need to set the threshold somewhere.
Yes, but I suspect that the loss may have to be different for enriched and non-enriched. My feeling is that we should loose more for the enriched. And right now, there is only one param. Do you want to discuss this in a short meeting?
To be honest from what I see, I think that we could kick off all the non-phosphorylated peptidoforms that are on the left side of the plots that I sent in earlier messages. This would mimic better what I see in experimental data.
... And maybe we should consider removing some phosphorylated peptides from the non-enriched.
I now updated PhosFake to change the averages (all modified vs all non-modified) and then adjust everything back to the full total intensity). Not sure about removing modified peptides from the non-enrichment. That comes back to the question of whether we want to decrease the signal of modified proteoforms.
Thanks a lot, I'll try it out. I don't agree with your second statement. The way I see it, removing modified peptides (to a certain extent) from the non-enriched would mimic the lower flyability of the modified peptides. No? But you can wait and see what I get from the modified pipeline and we'll decide :)
Good point and I honestly do not know how relevant this finally is. There is a discrepancy due to e.g. lower ionization of phopho-peptides.
Also a quite interesting question: Do we see less modified proteoforms due to the lower flyability or due to lower abundance?
Maybe we can test this with PhosFake? When I try the last version, here is what I get (we are getting closer): distribution of non-enriched peptidoforms intensity (blue is from experimental data, salmon is simulation -- I'll write the parameters below): For the phospho-enriched, we still have the issue that too many non-phosphorylated peptides are present with a low intensity (first high bar in the histogram; vertical group on the left of the scatter plot). I think that we should kick them off, but since we only filter based on quantiles, this won't work. Here is the scatter plot with the values in enriched and non-enriched. Basically, I'd kick out all the ones in the group on the left. I still have to figure out the best parameters for enrichment. Clearly need to increase the noise.
input_file -- ../../data/output/SimulatedDataSets/Real_data_visualisation//outputMSRun_900271b8758aebe3214902aade3a1ea6.RData
NumCond -- 9
NumReps -- 3
PathToFasta -- uniprotkb_mus_musculus_AND_reviewed_tru_2024_07_11.fasta
PathToProteinList -- NA
FracModProt -- 1
PropModPerProt -- 1
PTMTypes -- ph
PTMTypesDist -- 0.8
PTMTypesMass -- 79.9663
PTMMultipleLambda -- 0.1
ModifiableResidues -- c("S", "T", "Y")
ModifiableResiduesDistr -- c(0.86, 0.13, 0.01)
RemoveNonModFormFrac -- 0.8
paramProteoformAb -- 4
PropMissedCleavages -- 0.2
PercExpressedProt -- 0.25
QuantNoise -- 0.3
DiffRegFrac -- 0.1
DiffRegMax -- 3
UserInputFoldChanges_NumRegProteoforms -- NA
UserInputFoldChanges_RegulationFC -- NA
ThreshNAProteoform -- 0.005
AbsoluteQuanMean -- 7
AbsoluteQuanSD -- 2
ThreshNAQuantileProt -- 0.01
QuantColnames -- C_1_R_1|C_1_R_2|C_1_R_3|C_2_R_1|C_2_R_2|C_2_R_3|C_3_R_1|C_3_R_2|C_3_R_3|C_4_R_1|C_4_R_2|C_4_R_3|C_5_R_1|C_5_R_2|C_5_R_3|C_6_R_1|C_6_R_2|C_6_R_3|C_7_R_1|C_7_R_2|C_7_R_3|C_8_R_1|C_8_R_2|C_8_R_3|C_9_R_1|C_9_R_2|C_9_R_3
Enzyme -- trypsin.strict
PropMissedCleavages -- 0.2
MaxNumMissedCleavages -- 3
PepMinLength -- 7
PepMaxLength -- 50
LeastAbundantLoss -- 0.005
EnrichmentLoss -- 0.5
EnrichmentEfficiency -- 0.98
EnrichmentNoise -- 0.05
PercDetectability -- 0.08
PercDetectedVal -- 0.95
WeightDetectVal -- 0.05
MSNoise -- 0.1
WrongIDs -- 0.01
WrongLocalizations -- 0
MaxNAPerPep -- 100
ID -- outputMSRun_900271b8758aebe3214902aade3a1ea6
This does not look bad indeed! I assume the distributions of enriched and non-enriched fractions of the experimental data are anyways looking very similar?
I now changed the parameter to remove non-modified peptides to reach the given enrichment
Here is a MSRun that was generated to mimic experimental data. I see negative values in the output. Also, I see that the distribution of the enriched quantities is bimodal. I am not sure why. Here you can see a plot where each point is a peptidoform, vertical axis is the quantity in non enriched (condition 8, replicate 3), the horizontal axis is the quantity in the enriched table of the same sample. I did not identify what is going on.
outputMSRun_abd85b169eadc9d1af1765a62876b4ee.RData.zip