levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

mzML reader for RT ambiguous units #56

Closed mwang87 closed 2 years ago

mwang87 commented 2 years ago

It is not clear how to distinguish seconds vs minutes in the retention time for scans in mzML files. The unit is specified in the mzML

Here the unitName is "seconds"

<scanList count="1">
            <cvParam cvRef="MS" accession="MS:1000795" name="no combination" value=""/>
            <scan instrumentConfigurationRef="_x0031_">
              <cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="2.01" unitCvRef="UO" unitAccession="UO:0000010" unitName="second"/>
              <scanWindowList count="1">
                <scanWindow>
                  <cvParam cvRef="MS" accession="MS:1000501" name="scan window lower limit" value="410.530883789063" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
                  <cvParam cvRef="MS" accession="MS:1000500" name="scan window upper limit" value="410.530883789063" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
                </scanWindow>
              </scanWindowList>
            </scan>
          </scanList>

However, iterating through the spectra with the mzml reader, an example data structure:

{'instrumentConfigurationRef': '_x0031_', 'scanWindowList': {'count': 1, 'scanWindow': [{'scan window lower limit': 410.530883789063, 'scan window upper limit': 410.530883789063}]}, 'scan start time': 2.01}

This includes the scan start time, but not the unit.

mobiusklein commented 2 years ago

Hello,

It's likely that this is one of those features that I failed to document in Sphinx (or perhaps at all). Whenever pyteomics parses a cvParam with a unit, instead of converting the value of the param into a plain primitive like float or str, it will be converted into a unitfloat or unitstr which has an extra attribute unit_info. The unit_info will contain the unit name, or its accession code if the name is omitted.

I wanted to include the name of the unit in the repr, but that broke some libraries so it is only in the _repr_pretty_ hook used by IPython.

Given the XML:

<scanList count="1">
  <cvParam cvRef="MS" accession="MS:1000795" name="no combination" value=""/>
  <scan>
    <cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="0.004935" unitCvRef="UO" unitAccession="UO:0000031" unitName="minute"/>
    <cvParam cvRef="MS" accession="MS:1000512" name="filter string" value="FTMS + p ESI Full ms [200.00-2000.00]"/>
    <cvParam cvRef="MS" accession="MS:1000616" name="preset scan configuration" value="1"/>
    <cvParam cvRef="MS" accession="MS:1000927" name="ion injection time" value="68.227485656738" unitCvRef="UO" unitAccession="UO:0000028" unitName="millisecond"/>
    <scanWindowList count="1">
      <scanWindow>
        <cvParam cvRef="MS" accession="MS:1000501" name="scan window lower limit" value="200.0" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
        <cvParam cvRef="MS" accession="MS:1000500" name="scan window upper limit" value="2000.0" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
      </scanWindow>
    </scanWindowList>
  </scan>
</scanList>

You get the following dict:

{'count': 1,
 'scan': [{'scanWindowList': {'count': 1,
    'scanWindow': [{'scan window lower limit': 200.0 m/z,
      'scan window upper limit': 2000.0 m/z}]},
   'scan start time': 0.004935 minute,
   'filter string': 'FTMS + p ESI Full ms [200.00-2000.00]',
   'preset scan configuration': 1.0,
   'ion injection time': 68.227485656738 millisecond}],
 'no combination': ''}

To access the unit, you might write something like this:

>>> scan['scanList']['scan'][0]['scan start time'].unit_info
'minute'
mwang87 commented 2 years ago

Awesome, will give it a try! Was doing a double parse with pymzml and it was not a good time.

mobiusklein commented 2 years ago

@mwang87 I added documentation to describe how units are handled at https://pyteomics.readthedocs.io/en/latest/data.html#unit-handling. Does this sufficiently describe them for your purposes?

mwang87 commented 2 years ago

@mobiusklein This is great. This would have cleared it up the first time around (no worries my own documentation is not great!).

But overall, it worked like a charm. Thanks so much for being awesome!

Best,

Ming