levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

The issue of secondary spectrum #115

Closed morestart closed 1 year ago

morestart commented 1 year ago

If the mzML file contains the "spectrumList" field, data can be parsed, but if the field is "chromatogramList," the data cannot be parsed.

morestart commented 1 year ago

data example:

<run id="WD_x0020_tip-YP-SD116-CH3CH2OH-01NH4OH-QX-EPA-MRM-1" defaultInstrumentConfigurationRef="IC1" startTimeStamp="2023-05-19T15:42:21Z" defaultSourceFileRef="MSScan.bin">
      <chromatogramList count="35" defaultDataProcessingRef="pwiz_Reader_Agilent_conversion">
        <chromatogram index="0" id="TIC" defaultArrayLength="2381">
          <cvParam cvRef="MS" accession="MS:1000235" name="total ion current chromatogram" value=""/>
          <binaryDataArrayList count="3">
            <binaryDataArray encodedLength="18548">
              <cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" value=""/>
              <cvParam cvRef="MS" accession="MS:1000574" name="zlib compression" value=""/>
              <cvParam cvRef="MS" accession="MS:1000595" name="time array" value="" unitCvRef="UO" unitAccession="UO:0000031" unitName="minute"/>
              <binary></binary>
            </binaryDataArray>
mobiusklein commented 1 year ago

It's not clear how you're trying to use the library. If you can share a snippet of code, that would help understand what's not working as expected.

Guessing at what you might be doing, by default, the MzML class will iterate over spectrum elements. If you want to access chromatograms, you need to explicitly ask for it by calling the iterfind method:

from pyteomics.mzml import MzML

reader = MzML("path/to/file.mzML")

for chrom in reader.iterfind("chromatogram"):
    print(chrom)
morestart commented 1 year ago

You can find this file from https://github.com/morestart/TAFA-LAMS/blob/master/assets/WD%20tip-YP-SD116-CH3CH2OH-01NH4OH-QX-EPA-MRM-1.mzML

Your code is not working, i got this error: TypeError: only size-1 arrays can be converted to Python scalars

levitsky commented 1 year ago

Hi @morestart, the link doesn't seem to work, maybe the repo is private?

morestart commented 1 year ago

Sorry~Now the repository is public.

levitsky commented 1 year ago

Thank you! Full traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 for chrom in f.iterfind("chromatogram"):
      2     print(chrom)

File ~/py/pyteomics/pyteomics/xml.py:1239, in Iterfind.__next__(self)
   1237 if self._iterator is None:
   1238     self._iterator = self._make_iterator()
-> 1239 return next(self._iterator)

File ~/py/pyteomics/pyteomics/xml.py:1306, in IndexedIterfind._yield_from_index(self)
   1304 def _yield_from_index(self):
   1305     for key in self._task_map_iterator():
-> 1306         yield self.parser.get_by_id(key, **self.config)

File ~/py/pyteomics/pyteomics/auxiliary/file_helpers.py:84, in _keepstate_method.<locals>.wrapped(self, *args, **kwargs)
     82 self.seek(0)
     83 try:
---> 84     return func(self, *args, **kwargs)
     85 finally:
     86     self.seek(position)

File ~/py/pyteomics/pyteomics/xml.py:1151, in IndexedXML.get_by_id(self, elem_id, id_key, element_type, **kwargs)
   1149 except (KeyError, AttributeError, etree.LxmlError):
   1150     elem = self._find_by_id_reset(elem_id, id_key=id_key)
-> 1151 data = self._get_info_smart(elem, **kwargs)
   1152 return data

File ~/py/pyteomics/pyteomics/mzml.py:327, in MzML._get_info_smart(self, element, **kw)
    323     info = self._get_info(element,
    324             recursive=(rec if rec is not None else False),
    325             **kwargs)
    326 else:
--> 327     info = self._get_info(element,
    328             recursive=(rec if rec is not None else True),
    329             **kwargs)
    330 if 'binary' in info and isinstance(info, dict):
    331     info = self._handle_binary(info, **kwargs)

File ~/py/pyteomics/pyteomics/xml.py:433, in XML._get_info(self, element, **kwargs)
    431 else:
    432     if cname not in schema_info['lists']:
--> 433         info[cname] = self._get_info_smart(child, ename=cname, **kwargs)
    434     else:
    435         info.setdefault(cname, []).append(
    436             self._get_info_smart(child, ename=cname, **kwargs))

File ~/py/pyteomics/pyteomics/mzml.py:327, in MzML._get_info_smart(self, element, **kw)
    323     info = self._get_info(element,
    324             recursive=(rec if rec is not None else False),
    325             **kwargs)
    326 else:
--> 327     info = self._get_info(element,
    328             recursive=(rec if rec is not None else True),
    329             **kwargs)
    330 if 'binary' in info and isinstance(info, dict):
    331     info = self._handle_binary(info, **kwargs)

File ~/py/pyteomics/pyteomics/xml.py:436, in XML._get_info(self, element, **kwargs)
    433                 info[cname] = self._get_info_smart(child, ename=cname, **kwargs)
    434             else:
    435                 info.setdefault(cname, []).append(
--> 436                     self._get_info_smart(child, ename=cname, **kwargs))
    437 else:
    438     # handle the case where we do not want to unpack all children, but
    439     # *Param tags are considered part of the current entity, semantically
    440     for child in self._find_immediate_params(element, **kwargs):

File ~/py/pyteomics/pyteomics/mzml.py:339, in MzML._get_info_smart(self, element, **kw)
    337 for k in intkeys:
    338     if k in info:
--> 339         info[k] = int(info[k])
    340 return info

TypeError: only size-1 arrays can be converted to Python scalars

The offending key is "ms level"; it appears that in this file "ms level" is a non-standard data array.

morestart commented 1 year ago

any solutions?

levitsky commented 1 year ago

I suppose we can just add an exception handling clause around this statement. The latest commit in master implements this.

morestart commented 1 year ago

I suppose we can just add an exception handling clause around this statement. The latest commit in implements this.master

Yes, I changed this code to your commit. It works. Thanks for your work!

nkitagawa-venn commented 1 year ago

I suppose we can just add an exception handling clause around this statement. The latest commit in master implements this.

@levitsky I came across this issue today as well. Another way to fix this might be to change info[k] = int(info[k]) into info[k] = info[k].astype(int), which would more directly get across the transformation you are trying to apply. What do you think?

mobiusklein commented 1 year ago

The majority of values in info are not numpy objects, so astype isn't available for them. This particular case is being caused by the expectation that ms level is referring to the CV term MS:1000511, which is a scalar value, but it is being used as a non-standard array name (if the custom array were named ms level array, this wouldn't be an issue).

Binary data arrays, even non-standard ones, have their types explicitly given by a CV term already, so they don't need to be coerced the way other cvParam values and attribute values are.

nkitagawa-venn commented 1 year ago

The majority of values in info are not numpy objects, so astype isn't available for them. This particular case is being caused by the expectation that ms level is referring to the CV term MS:1000511, which is a scalar value, but it is being used as a non-standard array name (if the custom array were named ms level array, this wouldn't be an issue).

Binary data arrays, even non-standard ones, have their types explicitly given by a CV term already, so they don't need to be coerced the way other cvParam values and attribute values are.

Thank you @mobiusklein for the explanation!