sneumann / mzR

This is the git repository matching the Bioconductor package mzR: parser for netCDF, mzXML, mzData and mzML files (mass spectrometry data)
42 stars 27 forks source link

Better handling of mzid files #168

Open lgatto opened 6 years ago

lgatto commented 6 years ago

This issue is related to #136, which happens due to a but in the C++ RcppIndent::getScores() function, that counts the number of cvParams on the first SpectrumIdentificationItem, but then crashes if a later one has more. For example

[...]
           <cvParam accession="MS:1001171" name="Mascot:score" cvRef="PSI-MS" value="16.24" />
           <cvParam accession="MS:1001172" name="Mascot:expectation value" cvRef="PSI-MS" value="0.186044143910214" />
           <cvParam accession="MS:1001363" name="peptide unique to one protein" cvRef="PSI-MS" />
          </SpectrumIdentificationItem>
[...]

and then there are additional cvParams before the closing SpectrumIdentificationResult, which get also counted.

[...]
           <cvParam accession="MS:1001171" name="Mascot:score" cvRef="PSI-MS" value="16.24" />
            <cvParam accession="MS:1001172" name="Mascot:expectation value" cvRef="PSI-MS" value="0.186044143910214" />
            <cvParam accession="MS:1001363" name="peptide unique to one protein" cvRef="PSI-MS" />
          </SpectrumIdentificationItem>
          <cvParam accession="MS:1001371" name="Mascot:identity threshold"  cvRef="PSI-MS" value="26" />
          <cvParam accession="MS:1001370" name="Mascot:homology threshold"  cvRef="PSI-MS" value="21" />
          <cvParam accession="MS:1001030" name="number of peptide seqs compared to each spectrum"  cvRef="PSI-MS" value="477" />
          <cvParam accession="MS:1001114" name="retention time(s)"  cvRef="PSI-MS" value="1699.214471" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
          <cvParam accession="MS:1000796" name="spectrum title"  cvRef="PSI-MS" value="HeLa_180129_25c15W_r1.14170.14170.2 File:&quot;HeLa_180129_25c15W_r1.raw&quot;, NativeID:&quot;controllerType=0 controllerNumber=1 scan=14170&quot;" />
        </SpectrumIdentificationResult>
[...]

But more generally, it also assumes that the cvParams in each SpectrumIdentificationItem are in the same order, which I am not sure is guaranteed.

Ideally, RcppIndent::getScores() should not consider the additional ones outside of SpectrumIdentificationItem and possibly check the order/names of the cvParams.

lgatto commented 6 years ago

But more generally, it also assumes that the cvParams in each SpectrumIdentificationItem are in the same order, which I am not sure is guaranteed.

Indeed, I got confirmation that there is no guarantee that the order is maintained.