levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Unrecoverable case with non-standard data array missing non-standard data array declaration in mzML #5

Closed mobiusklein closed 4 years ago

mobiusklein commented 4 years ago

Trying to parse a Waters Apex3D-generated mzML file crashes with a misleading error:

/site-packages/pyteomics/mzml.pyc in _handle_binary(self, info, **kwargs)
    237         dtype = self._determine_array_dtype(info)
    238         compressed = self._determine_compression(info)
--> 239         name = self._detect_array_name(info)
    240         binary = info.pop('binary')
    241         if not self.decode_binary:
/site-packages/pyteomics/mzml.pyc in _detect_array_name(self, info)
    137         candidates = []
    138         for k in info:
--> 139             if k.endswith(' array') and not info[k]:
    140                 if NON_STANDARD_DATA_ARRAY == k:
    141                     is_non_standard = True

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This happens because one of the data arrays is not declared non-standard and does not have a userParam declaring the name ending with "array", so our recovery strategy doesn't work. This results in the name being returned as "binary".

<binaryDataArrayList count="4">
<binaryDataArray encodedLength="32" dataProcessingRef="ProteinLynx_Global_Server_data_processing">
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<cvParam cvRef="MS" accession="MS:1000514" name="m/z array" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z" />
<binary>AAAAYC5hbEAAAABgVd9hQAAAAIDEwGxA</binary>
</binaryDataArray>
<binaryDataArray encodedLength="32" dataProcessingRef="ProteinLynx_Global_Server_data_processing">
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<cvParam cvRef="MS" accession="MS:1000515" name="intensity array" unitCvRef="MS" unitAccession="MS:1000131" unitName="number of counts" />
<binary>AAAAAADAcEAAAAAAAFBxQAAAAAAA4H1A</binary>
</binaryDataArray>
<binaryDataArray encodedLength="32" dataProcessingRef="ProteinLynx_Global_Server_data_processing">
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<cvParam cvRef="MS" accession="MS:1000516" name="charge array" />
<binary>AAAAAAAA8D8AAAAAAADwPwAAAAAAAPA/</binary>
</binaryDataArray>
<binaryDataArray encodedLength="32" dataProcessingRef="ProteinLynx_Global_Server_data_processing">
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<cvParam cvRef="MS" accession="MS:1000595" name="time array" />
<binary>AAAAYDxf1T8AAAAg6rXTPwAAAOBl5tM/</binary>
</binaryDataArray>
<binaryDataArray encodedLength="32" dataProcessingRef="ProteinLynx_Global_Server_data_processing">
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<userParam name="drift time in bins" type="float" />
<binary>AAAAoG2QOEAAAAAg6Ms3QAAAAMAeTjhA</binary>
</binaryDataArray>
</binaryDataArrayList>

Because there is a key "binary" in the info dict, the next time this object goes through _get_info_smart, the whole spectrum goes back through _handle_binary again, and that is when the error hits.

There are a lot of comments on _detect_array_name, one of which explicitly specifies that returning "binary" will signal special handling elsewhere: https://github.com/levitsky/pyteomics/blob/master/pyteomics/mzml.py#L157-L165. I think I added something around this to support a different Waters-generated mzML three or four years ago, but lacking the malformed mzML file from then, I don't know what I was trying to recover from. I'm going to work a bit harder on taking any valid parameter name.

levitsky commented 4 years ago

Thank you. I think I follow the logic and it works for me with your snippet. Merging.

P.S. I wonder how soon we broke that return 'binary' path after adding it :)