lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

error in "addIdentificationData" #584

Closed ibphuangchen closed 1 year ago

ibphuangchen commented 1 year ago

Hi,

I found that for some MS identification results, the "addIdentificationData" function will not work. As quoted in the help document:

If after filtering, more then one PSM per spectrum are still present, these are combined (reduced, see reduce,data.frame-method) into a single row and separated by a semi-colon. This has as side-effect that feature variables that are being reduced are converted to characters. See the reduce manual page for examples.

It looks to me that there is a data frame join step using the acquisition.number column in the spectra data (numeric) and acquisitionNum in the identification data (which now is character, and many entries has ";"). Therefore, these two columns can not matched to each other.

I have encounter this problem whenever there are more than one PSMs per spectrum after the filtering, which is a common case for the identification results from many search engines. Could you please fix this issue?

lgatto commented 1 year ago

Hi @ibphuangchen - thank you for getting in touch. I would suggest to look at the Spectra package, and more specifically at the joinSpectraData() function. Feel free to open an issue there to follow up as needed.

ibphuangchen commented 1 year ago

Thank you @lgatto . BTW, are you going to fix this issue in MSnbase?

lgatto commented 1 year ago

No, I won't make any changes to MSnbase. I think what you describe should be discussed/addressed otherwise - why is it that there are multiple matches to one spectrum? This shouldn't really happen (unless you have chimeric matches); which one should be selected? I'm not sure the this is an issue to be handle when adding the PSM data to the raw spectra, but should rather be addressed before. I'm open to discussing this.

ibphuangchen commented 1 year ago

Hi @lgatto - you comments make very good sense to me. I found the problem I had was that when using the mzR to open the identification file (.pep.xml), for whatever reason, it resulted into a data frame in which a peptide with N modifications (e.g. 57.02147 for cysteine and there are N cysteines in the peptide) will be duplicated with N rows; each row with only a single modification with a specific "modMass" and "modLocation", but in reality all these modifications should all be there for this peptide. Same thing with MSnbase::readMzIdData.