very strange performance function behaviour

pavel-shliaha commented 8 years ago

The performance function and performance 2 behave very strangely. For the data in:

setwd ("Y:\RAW\pvs22_QTOF_DATA_data3\synapter2paper\kuharev2015\bugs_investigation\S130423_09")

"Y" is \prot-filesvr1

synapterAnalysis <- readRDS("synapterAnalysis.RDS")

performance (synapterAnalysis) (S) Synapter: 12813 EMRTs uniquely matched. (I) Ident: 24081 peptides. (Q) Quant: 11501 peptides. Enrichment (S/Q): 11.41% Overlap: Q S QS 4075 5297 7426

performance2 (synapterAnalysis) na.counts id.source FALSE TRUE rescue 2294 0 transfer 12813 8974

then I use

plotFragmentMatchingPerformance (synapterAnalysis)

capture

as you can see there is only ca 6.000 uniquely matched EMRTs not 12813.

sgibb commented 8 years ago

While looking at this I found something strange in the performance function:

## Ident peptides
I <- object$IdentPeptideData$precursor.leID
uI <- unique(I)

## Quant peptides
Q <- object$QuantPeptideData$precursor.leID
uQ <- unique(Q)

w <- c(length(setdiff(uQ, uS)),
       length(setdiff(uS, uQ)),
       length(intersect(uS, uQ)))
names(w) <- c("Q", "S", "QS")

What do the differences/intersection of precursor.leIDs tell us? Isn't that useless? E.g. if we use a master file we regenerate the precursor.leIDs as 1:nrow(x). The same precursor.leID in different runs means nothing or I am wrong?

Shouldn't we do something like:

intersect(quant$peptide.seq, ident$peptide.seq)

sgibb commented 8 years ago

Sorry, my fault. I missed the most important line:

## synapter results
S <- object$MatchedEMRTs[object$MatchedEMRTs$matchedEMRTs == 1,
                                     "spectrumID"]
uS <- unique(S)

The spectrumID corresponds to the quant$precursor.leID so it seems all right.

pavel-shliaha commented 8 years ago

tested it with new synapter version. Does not seem to work.

sgibb commented 8 years ago

@pavel-shliaha sorry if this was misleading. I didn't fixed anything yet.

pavel-shliaha commented 8 years ago

no probs, sorry

sgibb commented 8 years ago

Ok, the problem is maybe that I didn't document the plotFragmentMatching function properly. The EMRTs are categorised into unique-true/unique-false (or non-unique-true/non-unique-false) according to the precursor.leID.quant column in the MergedFeatures data.frame (same id == true match, otherwise false match (I now changed it in the following commit: https://github.com/lgatto/synapter/commit/874797256762ffa95c16c4e69cab83fd107630d1 ; same id == true match, different id == false match, no quant id available == no-quant-id (was treated as false match before)). That's because we rely on the PLGS identification to decide if it is a true or false match. In the current example the MergedFeatures data.frame has 9917 EMRTs.

There are 6165 EMRTs that are uniquely matched (by the grid search) to the same quant.id as PLGS did (== unique-true; blue dots/line left panel). Additionally there are 2244 EMRTs that are matched among others to the same quant.id as PLGS did (== non-unique-true; blue dots/line right panel; 3417 other matches; we would classify around 1500 of them wrongly as true-match if we accept a delta in number of common peaks of 0; red dots/line right panel)). 461 EMRTs are different from the PLGS result (==unique-false; red dots/line left panel).

performance reports 12813 uniquely matched EMRTs because we apply the fragment matching filter rules to all EMRTs (not only to MergedFeatures). We use the plotFragmentMatching function to estimate the error we would make if we filter by a specific threshold based on the "ground truth" we have in the MergedFeatures data.frame. That's why it is expected that performance reports a higher number of unique EMRTs than plotFragmentMatchingPerformance.

pavel-shliaha commented 8 years ago

sorry, Sebastian, my bad. I now understand that what happens actually makes sense! I will resume work on synapter paper shortly.

lgatto / synapter

very strange performance function behaviour #110