Closed mdnestor closed 3 years ago
The problem appear when there is an ambiguity in PTM assignment. In this case there are going to be two or more SpectrumIdentificationItem
elements within SpectrumIdentificationResult
element. Each SpectrumIdentificationItem
will have everything the same, except pointing to different peptide_ref
. Those peptide_ref elements will have different position of PTMs. If we don't distinguish these identification item by peptide_ref
, then all PTMs collapse onto the same SpectrumIdentificationResult
. We can provide a code example, where the absence of peptide_ref
creates a problem.
Here is an example of the issue in some example data where the sequence "GTQGATAGASSELDASK" appears in two separate identification results, with one of the results containing two distinct spectrum identification items.
library(mzR)
library(MSnID)
library(dplyr)
mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID")
mzRidentObj <- openIDfile(mzids)
x <- modifications(mzRidentObj) %>%
MSnID:::factor_to_str_converter() %>%
filter(sequence=="GTQGATAGASSELDASK")
print(x)
This output corresponds to 3 separate spectrum identification times within 2 spectrum identification results. But when counting the number of modifications, rows 1-6 are lumped together as a single spectrum identification result.
y <- x %>%
group_by(spectrumID, sequence) %>%
summarise(modification = paste(mass,' (',location,')',sep='',collapse=', ')) %>%
select(spectrumID,sequence,modification)
print(y)
The proposed solution is to return the peptide_ref as well. (For the purpose of this example the column is created manually.) For combining with PSM info downstream, this column is also added to the PSM info function.
y <- x
y$peptide_ref <- c(rep("Pep_GTQGATAGASS+80ELDASK",3),
rep("Pep_GTQGATAGAS+80SELDASK",3),
rep("Pep_GTQGATAGASS+80ELDASK",3))
print(y)
Now results can be grouped by peptide_ref as well:
y <- y %>%
group_by(spectrumID, sequence, peptide_ref) %>%
summarise(modification = paste(mass,' (',location,')',sep='',collapse=', ')) %>%
select(spectrumID,sequence,modification, peptide_ref)
print(y)
Hi, the diff looks good to me. The new peptide_ref
is at a specific position. Any danger that
some 3rd party code would address columns by position object[,7]
and this change moves
positions for the following columns by one ? Or can we assume that nobody is so stupid
to do positional/numbered access to the object
? Do we need another review by e.g. @lgatto ?
Yours, Steffen
I have one minor comment, related to the style guide (with no intention of flame war :-) is that all names use camel case, and that peptide_ref
uses snake case. Irrespective of personal preferences, I would suggest to use peptideRef
for consistency.
Otherwise, I am ok - thank you @mdnestor @vladpetyuk
@lgatto I just changed the column name to peptideRef - flame war averted!
If the ordering of columns could be an issue as @sneumann suggested then I could put the peptideRef column last. Less natural but maybe safer.
No need for me. Teaches people to use columns by name? Yours Steffen
I blame Android for the brevity and typos
I have pushed to pushed to Bioconductor. This features should become available in mzR 2.23.1 in 24 hours or so.
Suggested by @vladpetyuk for implementing mzR as a backend in the Bioconductor package MSnID.
The issue is the functions
getModInfo
andgetPsmInfo
don't provide enough information to group by spectrum identification item (SII), similar to issue #155. This change adds the peptide_ref column to the output of both functions which allows grouping by SII.