Add peptide_ref column to modification info and PSM info

mdnestor commented 3 years ago

Suggested by @vladpetyuk for implementing mzR as a backend in the Bioconductor package MSnID.

The issue is the functions getModInfo and getPsmInfo don't provide enough information to group by spectrum identification item (SII), similar to issue #155. This change adds the peptide_ref column to the output of both functions which allows grouping by SII.

vladpetyuk commented 3 years ago

The problem appear when there is an ambiguity in PTM assignment. In this case there are going to be two or more SpectrumIdentificationItem elements within SpectrumIdentificationResult element. Each SpectrumIdentificationItem will have everything the same, except pointing to different peptide_ref. Those peptide_ref elements will have different position of PTMs. If we don't distinguish these identification item by peptide_ref, then all PTMs collapse onto the same SpectrumIdentificationResult. We can provide a code example, where the absence of peptide_ref creates a problem.

mdnestor commented 3 years ago

Here is an example of the issue in some example data where the sequence "GTQGATAGASSELDASK" appears in two separate identification results, with one of the results containing two distinct spectrum identification items.

library(mzR)
library(MSnID)
library(dplyr)

mzids <- system.file("extdata","phospho.mzid.gz",package="MSnID")
mzRidentObj <- openIDfile(mzids)

x <- modifications(mzRidentObj) %>%
    MSnID:::factor_to_str_converter() %>%
    filter(sequence=="GTQGATAGASSELDASK")
print(x)

This output corresponds to 3 separate spectrum identification times within 2 spectrum identification results. But when counting the number of modifications, rows 1-6 are lumped together as a single spectrum identification result.

y <- x %>%
  group_by(spectrumID, sequence) %>%
  summarise(modification = paste(mass,' (',location,')',sep='',collapse=', ')) %>%
  select(spectrumID,sequence,modification)
print(y)

The proposed solution is to return the peptide_ref as well. (For the purpose of this example the column is created manually.) For combining with PSM info downstream, this column is also added to the PSM info function.

y <- x
y$peptide_ref <- c(rep("Pep_GTQGATAGASS+80ELDASK",3),
                   rep("Pep_GTQGATAGAS+80SELDASK",3),
                   rep("Pep_GTQGATAGASS+80ELDASK",3))
print(y)

Now results can be grouped by peptide_ref as well:

y <- y %>%
  group_by(spectrumID, sequence, peptide_ref) %>%
  summarise(modification = paste(mass,' (',location,')',sep='',collapse=', ')) %>%
  select(spectrumID,sequence,modification, peptide_ref)
print(y)

sneumann commented 3 years ago

Hi, the diff looks good to me. The new peptide_ref is at a specific position. Any danger that some 3rd party code would address columns by position object[,7] and this change moves positions for the following columns by one ? Or can we assume that nobody is so stupid to do positional/numbered access to the object ? Do we need another review by e.g. @lgatto ? Yours, Steffen

lgatto commented 3 years ago

I have one minor comment, related to the style guide (with no intention of flame war :-) is that all names use camel case, and that peptide_ref uses snake case. Irrespective of personal preferences, I would suggest to use peptideRef for consistency.

lgatto commented 3 years ago

Otherwise, I am ok - thank you @mdnestor @vladpetyuk

mdnestor commented 3 years ago

@lgatto I just changed the column name to peptideRef - flame war averted!

If the ordering of columns could be an issue as @sneumann suggested then I could put the peptideRef column last. Less natural but maybe safer.

sneumann commented 3 years ago

No need for me. Teaches people to use columns by name? Yours Steffen

I blame Android for the brevity and typos

lgatto commented 3 years ago

I have pushed to pushed to Bioconductor. This features should become available in mzR 2.23.1 in 24 hours or so.

sneumann / mzR

Add peptide_ref column to modification info and PSM info #227