Open wdwvt1 opened 1 year ago
Hello @wdwvt1, thank you very much for submitting this issue. It seems that there is a limitation in how the mzML parser works.I implemented it originally as a minimal parser just to be able to read MS scans and chromatograms, and I did not considered PDA data.
Unfortunately, I think that the solution that you are proposing will not work because for example, in mzML files generated from Waters instruments measurements the spectrum tag looks like this:
<spectrum index="0" id="function=1 process=0 scan=100" defaultArrayLength="4969" dataProcessingRef="pwiz_Reader_Waters_conversion">
I think that the best approach would be to explore the mzML specification in depth before trying to define a workaround for this problem. I can work on it in the following days if you provide me with an mzML file that can be used to reproduce the error that you are getting.
Also, what tool are you using to convert your files to mzML? If your code is running the function _mzml._build_offset_list_non_indexed
it would seem that you are creating non-indexed mzML data. In general it is recommended to create indexed mzML files as they already include the index. Can you check if the error still persists when using an indexed mzML file? btw the msconvert utility from Proteowizard allows you to create indexed mzML files.
Hi @griquelme - the issue happens as well when _mzml._build_offset_list_indexed
is used. My mzml files are actually indexed, but to figure out where the error was happening I was forcing the code into using the non indexed version (just forcing is_indexed
to return False
).
I use the Thermo RawFileParser for my data extraction. You can see a reference to another issue here where the maintainer of that code helps me dig into how Thermo is writing the data from the Agilent machine.
Unfortunately, I think that the solution that you are proposing will not work because for example, in mzML files generated from Waters instruments measurements the spectrum tag looks like this:
I definitely understand there are a lot of formats to support and I am sure they all write data in their own way. One of the most frustrating parts of the LCMS ecosystem is how little documentation there is for vendor formats :/. If I can help I would be happy to.
My idea about exposing the parser would just be in maybe optional keywords passed to Assay
.
Based on your notebook examples, would something like allowing the user to pass a spectrum_regex
string to the constructor of Assay
work? It would default to the code you currently have, but would allow a special case regex to be passed.
data_path = "/mnt/d/pca-project-mzml/untargeted-assay/"
assay = ms.Assay(
assay_path="pca-untargeted",
data_path=data_path,
sample_metadata="data/sample-metadata.csv",
spectrum_regex="whatever"
)
Yeah I totally agree with you that there is very little documentation available for vendor specific formats. It would be really good if you can help me with this :).
I will look later at the other issue that you linked, but I think that the solution to this problem would be to work only with the MZMLReader
object, as the Assay
object should not care about how the data is read, as it only interacts with the MSData
class.
Maybe another option to try is to see if the Spectrum tag that contains PDA data has MS level information. For example, most spectra data in a mzML file contains this tag:
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1"/>
This can be checked during the creation of the index to avoid indexing data that is not associated with an MS scan. Can you check if this information is available for PDA data in the mzML that you are using?
Hi @griquelme - first, thanks for writing
tidyms
, it's a great package. I have been searching for a usable python library to build with.The issue I am currently having is in parsing my mzml files. My LCMS stack has an Agilent LC (1290 I) and a Thermo MS (Orbitrap QE). As a consequence of this setup, the raw files and mzml files that are written by my system contain additional
spectrum
elements containing UV/VIS, PDA, and pressure data (in addition to the expected M/Z and intensity data).When
_mzml._build_offset_list_non_indexed
searches for spectrum offsets thespectrum_regex
identifies these non-MS data spectra. As a result, the returnedspectrum_offset_list
contains too many elements. In my test data it returns 5891 offsets. This is the sum of the offsets associated with my recorded MS scans (2891) and the offsets associated with my PDA detector data (3000).This causes an error when data gets read by
fileio.MSData.get_spectrum
. In my case, this error is produced because the parser doesn't find a string name for the wavelength data array from the PDA. Below is a print out of the data from the spectrum iterator. It fails when it encounters theNone
name for the PDA wavelength data.My current workaround is just to alter the
spectrum_regex
definitionHelpfully, Thermo appears to write anything that isn't Thermo as
controllerType=4
so I can just exclude the Agilent PDA data.Ultimately, it would help to be able to specify this regex or pass additional parameters to avoid this situation. I am happy to submit a PR, but need a little help figuring out how the situations work with indexed and non-indexed mzml files. I am fairly unfamiliar with the mzml format.