lgatto / MSnbase

Base Classes and Functions for Mass Spectrometry and Proteomics
http://lgatto.github.io/MSnbase/
123 stars 50 forks source link

Error with readMzIdData on Comet output #596

Closed 524D closed 8 months ago

524D commented 8 months ago

Hi,

When using readMzIdData to read mzID data produced by Comet, it fails with the following error:

Error in (function (cond)  : 
  error in evaluating the argument 'X' in selecting a method for function 'lapply': bad lexical cast: source type value could not be interpreted as target

After bisecting the mzID file, it appears there are two problems:

  1. Comet adds the line <cvParam cvRef="PSI-MS" accession="MS:1002500" name="peptide passes threshold" value="false" /> in the peptide scores. The value "false" is apparently not accepted by readMzIdData , though this seems valid according to the schema.
  2. Comet puts multiple results for the same spectrum in separate SpectrumIdentificationResult tags. This seems to be invalid and I will issue a bug report for Comet. It would still be nice if readMzIdData could be a bit more permissive though.

For demonstration, the attached ZIP file contains a minimized mzID file with only two peptides, plus as a manually edited file where above problems are fixed.

testdata.zip

I'm using R version 4.3.1, MSnbase version 2.26.0, Bioconductor version = "3.17" on Windows 10.

Best, Rob

524D commented 8 months ago

In the latest (currently unreleased) version of Comet, problem number 2 is fixed and the problematic tag mentioned at problem 1 is removed. Output produced by that Comet version can be read successfully with readMzIdData. While problem number 1 still appears to be a bug in readMzIdData, in practice it means everything works correct with the fixed Comet version and will probably never work with the unfixed Comet version. Therefore I'm closing this issue.

lgatto commented 8 months ago

Hi @524D - thank you very much for following up.

For problem 1, it might be because the underlying XML schema that readMzIdData() uses, that comes from proteowizard with mzR, isn't recent enough.

I would suggest you look at the PSMatch package for working with identification data. The PSM() constructor is the replacement of readMzIdData(). It won't fix problem 1, as both make use of mzR, although you might want to try the other (slower) backend, that is based on the mzID package instead. The PSMatch package should provide the existing MSnbase functionality (let me know if anything is missing) and more.