levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

RE #94

Closed mafreitas closed 1 year ago

mafreitas commented 1 year ago

I am trying to read a waters file that was converted by proteowizard and supplied by a client. It had the binary data array name supplied as a reference to theParm Group. When I try and read the file I get

pyteomics\mzml.py:210: UserWarning: Multiple options for naming binary array after no valid name found: ['ref']

Is there an option to read the param group info for the binary data arrays that I am missing?

Snippets from the mzml are below for reference.

        <referenceableParamGroup id="mz_params">
            <cvParam accession="MS:1000514" cvRef="MS" name="m/z array"
                unitAccession="MS:1000040" unitCvRef="MS" unitName="m/z"/>
            <cvParam accession="MS:1000523" cvRef="MS" name="64-bit float"/>
            <cvParam accession="MS:1000576" cvRef="MS" name="no compression"/>
        </referenceableParamGroup>
        <referenceableParamGroup id="int_params">
            <cvParam accession="MS:1000515" cvRef="MS"
                name="intensity array" unitAccession="MS:1000131"
                unitCvRef="MS" unitName="number of counts"/>
            <cvParam accession="MS:1000523" cvRef="MS" name="64-bit float"/>
            <cvParam accession="MS:1000576" cvRef="MS" name="no compression"/>
        </referenceableParamGroup>
        <referenceableParamGroup id="charge_params">
            <cvParam accession="MS:1000516" cvRef="MS" name="charge array"/>
            <cvParam accession="MS:1000523" cvRef="MS" name="64-bit float"/>
            <cvParam accession="MS:1000576" cvRef="MS" name="no compression"/>
        </referenceableParamGroup>
    </referenceableParamGroupList>
                <binaryDataArrayList count="3">
                    <binaryDataArray dataProcessingRef="PLGS_processing" encodedLength="26848">
                        <referenceableParamGroupRef ref="mz_params"/>
                        <binary> ... </binary>
                    </binaryDataArray>
                    <binaryDataArray dataProcessingRef="PLGS_processing" encodedLength="26848">
                        <referenceableParamGroupRef ref="int_params"/>
                        <binary> ... </binary>
                    </binaryDataArray>
                    <binaryDataArray dataProcessingRef="PLGS_processing" encodedLength="0">
                        <referenceableParamGroupRef ref="charge_params"/>
                        <binary> ... </binary>
                    </binaryDataArray>
                </binaryDataArrayList>
levitsky commented 1 year ago

It seems to me that _detect_array_name does not know how to do this. My understanding is that we need to

However, I may be misunderstanding something entirely. @mobiusklein would you be able to take a look?

mobiusklein commented 1 year ago

Really, we ought to handle referenceableParamGroupRef the same way we handle cvParam. I see the path forward. Give me 30 minutes, a bandsaw, a left handed spanner, and a sprig of thyme and we'll see what breaks.

mafreitas commented 1 year ago

If you need a file, I can setup a shared folder on google.

mobiusklein commented 1 year ago

Short term solution that works on master: When you create your parser, pass retrieve_refs=True.

Longer term solution that won't require code changes (but could be backwards incompatible) would be an incoming PR

mobiusklein commented 1 year ago

@mafreitas please take a look at PR #95 and see if that branch solves your issue for you. I was able to download your example files and read them without issue.

mafreitas commented 1 year ago

I can confirm that it does work. Thank you.