Negative values in `AfterMSRun$Enriched`

mlocardpaulet commented 1 month ago

Here is a MSRun that was generated to mimic experimental data. I see negative values in the output. Also, I see that the distribution of the enriched quantities is bimodal. I am not sure why. Here you can see a plot where each point is a peptidoform, vertical axis is the quantity in non enriched (condition 8, replicate 3), the horizontal axis is the quantity in the enriched table of the same sample. I did not identify what is going on.

outputMSRun_abd85b169eadc9d1af1765a62876b4ee.RData.zip

mlocardpaulet commented 1 month ago

OK, I get it now. Stupid of me not to see it. The ones with extra low signal are the non-phospho:

But some should disappear instead of having negative values. What do you think? Also, here the signal difference is wayyyy too high. But I suspect that it could be corrected with a parameter. EnrichmentLoss maybe? (here it is 0.25 -- it may not be the same example as above...)

veitveit commented 1 month ago

The EnrichmentLoss divides the total signal of all (!) peptides into enriched and non-enriched.

In your picture, the few phosphorylated peptides get 75% of the total intensity and the many non-modified ones get the 25%. As there are many more non-modified peptides, the difference becomes so drastic.

Not sure what would be better. a) taking the total intensities like here or b) make it proportional to the number of peptides such that a phosphorylated peptide is in average that much larger than a non-modified one.

mlocardpaulet commented 1 month ago

hum... I see. This signal split is a bit weird, isn't it? It makes it difficult to decide on a value for EnrichmentLoss. I naively thought that it was a loss of signal applied to the non-phosphorylated. peptides.

Here is how the same plot looks on one data set that I have: It is not so divided (and actually here, only a few phosphorylated peptides are found in the non phospho, but due to other things...).

mlocardpaulet commented 1 month ago

maybe this parameter could shift the distribution of non-phosphorylated peptides to the left? I mean, this is what you do already 😅 , but we could give the value of the shift instead of a value to split the data set?

mlocardpaulet commented 1 month ago

Also, I think that there may be something weird with EnrichmentEfficiency. I only use values ≥ 0.8 but you can see in these plots that there are not many phorphorylated peptides. At least less than 80%

veitveit commented 1 month ago

The idea of enrichment would be to enhance the signals of the phosphorylated peptides. This is why I kept the total intensity of all peptides. Fixing the differences of averages between phosphorylated and non-phosphorylated peptides is then easier to use as one then can "predict" the outcome from the parameter. Should I do that?

When you use the 80%, can it be that you in general do not have many phosphorylated peptides?

mlocardpaulet commented 1 month ago

In experimental data, we calculate enrichment as number of phosphorylated peptide / total number of peptides and in general we have 0.7 to 0.95. So we really "loose" non-phosphorylated peptides. I think that we should do the same.

Your last point is a good one. I'll look into it.

mlocardpaulet commented 1 month ago

Maybe we can use the parameter as differences of averages between phosphorylated and non-phosphorylated peptides, and then have a step to remove the less abundant ones? These would naturally be non-phosphorylated.

veitveit commented 1 month ago

I assumed the less abundant to be anyways be removed in the MSRun part. Otherwise, we would need to set the threshold somewhere.

mlocardpaulet commented 1 month ago

Yes, but I suspect that the loss may have to be different for enriched and non-enriched. My feeling is that we should loose more for the enriched. And right now, there is only one param. Do you want to discuss this in a short meeting?

mlocardpaulet commented 1 month ago

To be honest from what I see, I think that we could kick off all the non-phosphorylated peptidoforms that are on the left side of the plots that I sent in earlier messages. This would mimic better what I see in experimental data.

mlocardpaulet commented 1 month ago

... And maybe we should consider removing some phosphorylated peptides from the non-enriched.

veitveit commented 1 month ago

I now updated PhosFake to change the averages (all modified vs all non-modified) and then adjust everything back to the full total intensity). Not sure about removing modified peptides from the non-enrichment. That comes back to the question of whether we want to decrease the signal of modified proteoforms.

mlocardpaulet commented 1 month ago

Thanks a lot, I'll try it out. I don't agree with your second statement. The way I see it, removing modified peptides (to a certain extent) from the non-enriched would mimic the lower flyability of the modified peptides. No? But you can wait and see what I get from the modified pipeline and we'll decide :)

veitveit commented 1 month ago

Good point and I honestly do not know how relevant this finally is. There is a discrepancy due to e.g. lower ionization of phopho-peptides.

Also a quite interesting question: Do we see less modified proteoforms due to the lower flyability or due to lower abundance?

mlocardpaulet commented 1 month ago

Maybe we can test this with PhosFake? When I try the last version, here is what I get (we are getting closer): distribution of non-enriched peptidoforms intensity (blue is from experimental data, salmon is simulation -- I'll write the parameters below): For the phospho-enriched, we still have the issue that too many non-phosphorylated peptides are present with a low intensity (first high bar in the histogram; vertical group on the left of the scatter plot). I think that we should kick them off, but since we only filter based on quantiles, this won't work. Here is the scatter plot with the values in enriched and non-enriched. Basically, I'd kick out all the ones in the group on the left. I still have to figure out the best parameters for enrichment. Clearly need to increase the noise.

input_file -- ../../data/output/SimulatedDataSets/Real_data_visualisation//outputMSRun_900271b8758aebe3214902aade3a1ea6.RData 
NumCond -- 9 
NumReps -- 3 
PathToFasta -- uniprotkb_mus_musculus_AND_reviewed_tru_2024_07_11.fasta 
PathToProteinList -- NA 
FracModProt -- 1 
PropModPerProt -- 1 
PTMTypes -- ph 
PTMTypesDist -- 0.8 
PTMTypesMass -- 79.9663 
PTMMultipleLambda -- 0.1 
ModifiableResidues -- c("S", "T", "Y") 
ModifiableResiduesDistr -- c(0.86, 0.13, 0.01) 
RemoveNonModFormFrac -- 0.8 
paramProteoformAb -- 4 
PropMissedCleavages -- 0.2 
PercExpressedProt -- 0.25 
QuantNoise -- 0.3 
DiffRegFrac -- 0.1 
DiffRegMax -- 3 
UserInputFoldChanges_NumRegProteoforms -- NA 
UserInputFoldChanges_RegulationFC -- NA 
ThreshNAProteoform -- 0.005 
AbsoluteQuanMean -- 7 
AbsoluteQuanSD -- 2 
ThreshNAQuantileProt -- 0.01 
QuantColnames -- C_1_R_1|C_1_R_2|C_1_R_3|C_2_R_1|C_2_R_2|C_2_R_3|C_3_R_1|C_3_R_2|C_3_R_3|C_4_R_1|C_4_R_2|C_4_R_3|C_5_R_1|C_5_R_2|C_5_R_3|C_6_R_1|C_6_R_2|C_6_R_3|C_7_R_1|C_7_R_2|C_7_R_3|C_8_R_1|C_8_R_2|C_8_R_3|C_9_R_1|C_9_R_2|C_9_R_3 
Enzyme -- trypsin.strict 
PropMissedCleavages -- 0.2 
MaxNumMissedCleavages -- 3 
PepMinLength -- 7 
PepMaxLength -- 50 
LeastAbundantLoss -- 0.005 
EnrichmentLoss -- 0.5 
EnrichmentEfficiency -- 0.98 
EnrichmentNoise -- 0.05 
PercDetectability -- 0.08 
PercDetectedVal -- 0.95 
WeightDetectVal -- 0.05 
MSNoise -- 0.1 
WrongIDs -- 0.01 
WrongLocalizations -- 0 
MaxNAPerPep -- 100 
ID -- outputMSRun_900271b8758aebe3214902aade3a1ea6

veitveit commented 1 month ago

This does not look bad indeed! I assume the distributions of enriched and non-enriched fractions of the experimental data are anyways looking very similar?

veitveit commented 1 month ago

I now changed the parameter to remove non-modified peptides to reach the given enrichment

veitveit / PhosFake

Negative values in `AfterMSRun$Enriched` #14