exposing more flexibility in `_mzml._build_offset_list_non_indexed`

Hi @griquelme - first, thanks for writing tidyms, it's a great package. I have been searching for a usable python library to build with.

The issue I am currently having is in parsing my mzml files. My LCMS stack has an Agilent LC (1290 I) and a Thermo MS (Orbitrap QE). As a consequence of this setup, the raw files and mzml files that are written by my system contain additional spectrum elements containing UV/VIS, PDA, and pressure data (in addition to the expected M/Z and intensity data).

When _mzml._build_offset_list_non_indexed searches for spectrum offsets the spectrum_regex identifies these non-MS data spectra. As a result, the returned spectrum_offset_list contains too many elements. In my test data it returns 5891 offsets. This is the sum of the offsets associated with my recorded MS scans (2891) and the offsets associated with my PDA detector data (3000).

This causes an error when data gets read by fileio.MSData.get_spectrum. In my case, this error is produced because the parser doesn't find a string name for the wavelength data array from the PDA. Below is a print out of the data from the spectrum iterator. It fails when it encounters the None name for the PDA wavelength data.

{'ms_level': 1, 'polarity': 1, 'time': 960.1218000000001, 'mz': array([  99.00540736,   99.00564096,   99.00587457, ..., 1515.13452207,
       1515.14850722, 1515.16249256]), 'spint': array([0., 0., 0., ..., 0., 0., 0.]), 'is_centroid': False}
{'time': 0.40000000002, None: array([190., 192., 194., 196., 198., 200., 202., 204., 206., 208., 210.,
       212., 214., 216., 218., ......]), 'spint': array([ 2.18286e+05, -4.90926e+05, -3.30683e+05, -3.09083e+05,
       -5.18884e+05, -5.05399e+05, -2.69533e+05, -2.26293e+05,
       -1.50526e+05, -1.45612e+05, -1.12733e+05, -1.12920e+05,
       -8.19990e+04,..........
       -1.27050e+04, -1.08800e+03, -1.57300e+04,  2.60700e+03,
       -1.47880e+04, -8.19600e+03,  3.62600e+03, -7.22200e+03,
        6.90000e+03, -1.40910e+04, -4.10000e+02, -9.73000e+03,
       -1.78600e+03,  2.84300e+03]), 'is_centroid': False}

My current workaround is just to alter the spectrum_regex definition

# spectrum_regex = re.compile("<spectrum .[^(><.)]+>")  
spectrum_regex = re.compile('<spectrum index=\"[0-9]+\" id="controllerType=0')

Helpfully, Thermo appears to write anything that isn't Thermo as controllerType=4 so I can just exclude the Agilent PDA data.

Ultimately, it would help to be able to specify this regex or pass additional parameters to avoid this situation. I am happy to submit a PR, but need a little help figuring out how the situations work with indexed and non-indexed mzml files. I am fairly unfamiliar with the mzml format.

Hello @wdwvt1, thank you very much for submitting this issue. It seems that there is a limitation in how the mzML parser works.I implemented it originally as a minimal parser just to be able to read MS scans and chromatograms, and I did not considered PDA data.

Unfortunately, I think that the solution that you are proposing will not work because for example, in mzML files generated from Waters instruments measurements the spectrum tag looks like this:

<spectrum index="0" id="function=1 process=0 scan=100" defaultArrayLength="4969" dataProcessingRef="pwiz_Reader_Waters_conversion">

I think that the best approach would be to explore the mzML specification in depth before trying to define a workaround for this problem. I can work on it in the following days if you provide me with an mzML file that can be used to reproduce the error that you are getting.

Also, what tool are you using to convert your files to mzML? If your code is running the function _mzml._build_offset_list_non_indexed it would seem that you are creating non-indexed mzML data. In general it is recommended to create indexed mzML files as they already include the index. Can you check if the error still persists when using an indexed mzML file? btw the msconvert utility from Proteowizard allows you to create indexed mzML files.

Hi @griquelme - the issue happens as well when _mzml._build_offset_list_indexed is used. My mzml files are actually indexed, but to figure out where the error was happening I was forcing the code into using the non indexed version (just forcing is_indexed to return False).

I use the Thermo RawFileParser for my data extraction. You can see a reference to another issue here where the maintainer of that code helps me dig into how Thermo is writing the data from the Agilent machine.

Unfortunately, I think that the solution that you are proposing will not work because for example, in mzML files generated from Waters instruments measurements the spectrum tag looks like this:

I definitely understand there are a lot of formats to support and I am sure they all write data in their own way. One of the most frustrating parts of the LCMS ecosystem is how little documentation there is for vendor formats :/. If I can help I would be happy to.

My idea about exposing the parser would just be in maybe optional keywords passed to Assay. Based on your notebook examples, would something like allowing the user to pass a spectrum_regex string to the constructor of Assay work? It would default to the code you currently have, but would allow a special case regex to be passed.


data_path = "/mnt/d/pca-project-mzml/untargeted-assay/"
assay = ms.Assay(
    assay_path="pca-untargeted",
    data_path=data_path,
    sample_metadata="data/sample-metadata.csv",
    spectrum_regex="whatever"

)

Yeah I totally agree with you that there is very little documentation available for vendor specific formats. It would be really good if you can help me with this :).

I will look later at the other issue that you linked, but I think that the solution to this problem would be to work only with the MZMLReader object, as the Assay object should not care about how the data is read, as it only interacts with the MSData class.

Maybe another option to try is to see if the Spectrum tag that contains PDA data has MS level information. For example, most spectra data in a mzML file contains this tag:

<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1"/>

This can be checked during the creation of the index to avoid indexing data that is not associated with an MS scan. Can you check if this information is available for PDA data in the mzML that you are using?

griquelme / tidyms

exposing more flexibility in `_mzml._build_offset_list_non_indexed` #6