Spectrum Information - Githubissues

ghost commented 4 years ago

Requested Feature I am trying to use pymzML as a converter and extractor for a lipidomics pipeline. The pipeline only works with profile mode. The pipeline has so far worked perfectly with data from Thermofischer Instruments due to the fact that XRawfile.dll from Thermo allows direct extraction of data from a raw file into text format. The ABSciex converter only supports conversion of files into mzML hence the reason I have stumbled upon your tool.

Desired solution Currently, the following is possible scan_rt = [ ] msrank = [ ] spectrum= pymzml.run.Reader(file,MS1_Precision= 50e-6,MSn_Precision= 500e-6) for spec in spectrum: scan_rt.append(spec.scan_time[0]) ms_rank.append(int(spec.ms_level)) datapoints = pd.Dataframe(data = {"m/z" : spec.mz, "intensity": spec.i })

I have been able to retrieve the above information with no problem have failed to extract 'scan', 'scanWindowList','scanWindow', 'cvParam' or detector type, ionization source type, collision chamber type, collision energy and precursor masses without any success. The final output from my converter is like this :

├───MS1 │ ├───neg │ │ └───FTMS - p ESI Full ms [200.0000-1600.0000] │ └───pos │ └───FTMS + p ESI Full ms [200.0000-1600.0000] └───MS2 ├───neg │ └───FTMS - p ESI d Full ms2 200.8057@hcd20.00 [50.0000-225.0000] └───pos └───FTMS + p ESI d Full ms2 200.8057@hcd20.00 [50.0000-225.0000]

Each final subfolder contains a text file e.g Sample1_0.2345.txt, where Sample is the original file name and the number after the underscore is the corresponding retention time for that specific spectrum. That means, inside the text file, there are all the profile data points belonging to that specific spectrum and retention time. In MS1 neg, for example, there might be up to 20 thousand of those text files. In MS2, there will be fewer per subfolder due to the all files in the folder stemming from one specific precursor.

Alternatives considered I have also considered using pyOpenMS but it seems as if one can already extract more information from the pyMZML interface element tree than one can with pyOpenMS. Secondly, the other well documented Lipidomics software (e.g LipidHunter) already use pymzML so it seemed like the most suitable tool. Finally, as you might have noticed, this leads to extremely large datafiles and the tutorial on multiprocessing showed me that once, I have the information extracted, the whole sorting of the data into the desired format could be accelerated.

Additional context The lipidomics tool described above is the standalone version (non web based) of this tool http://www.alex123.info/

MKoesters commented 4 years ago

Hi,

Okay, I see there are multiple information you want to access in the mzML files, some of them being a tree structure rather than a single xml tag.

ScanWindow and ScanWindowList for example are tree structures like this:

<scanWindowList count="1">
    <scanWindow>
        <cvParam cvRef="MS" accession="MS:1000501" name="scan window lower limit" value="100.0" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
        <cvParam cvRef="MS" accession="MS:1000500" name="scan window upper limit" value="2295.0" unitCvRef="MS" unitAccession="MS:1000040" unitName="m/z"/>
    </scanWindow>
</scanWindowList>

accessing the lower and upper limit in this tree would work as the following:

In [5]: ms2_spec['scan window upper limit']
Out[5]: 2295.0
In [6]: ms2_spec['scan window lower limit']
Out[6]: 100.0

For precursor, a special attribute is implemented which allows you to access the precursor (or all of them if multi-notch) like this:

In [4]: ms2_spec.selected_precursors
Out[4]: [{'mz': 1123.837768554688, 'i': 9400.341796875, 'charge': 2}]

The collision energy I can access like this

In [7]: ms2_spec['collision energy']
Out[7]: 35.0

For detector type, ionization source type, collision chamber type, I'll need to have a closer look.

If any of the methods mentioned above does not work for you, I'd need the mzML file you are working with (or a sample containing some MS1 and MS2 spectra, does not need to be complete). The above code snippets were generated with a mzML file from a Thermo machine, since this is mainly what I'm working with and I don't have a proper ABSciex File (except in DIA mode) to use for testing.

If everything goes wrong, you could also use the find method of spec.element, however this requires some more coding efforts and most of the stuff you would do like this should be covered by pymzMLs interface.

I hope this helps at least a bit with you problem, if not please tell me so I can either fix your issues or extend the interface so that it's able to do want you want to do :)

Best, Manuel

ghost commented 4 years ago

Hi,

Thanks for the response. The code snippets you have given are really helpful. Another question that I forgot to add, how does one extract the polarity as well ?Below is a link to a repository with ABSciex (DDA)mzml files you could use for testing.

https://syncandshare.lrz.de/getlink/fi2dHLGcx4ZndHgVQBPjuKoT/

I have to admit, I do not have much experience with conversion of ABSciex DDA .WIFF files into mzML files. I used the following parameters from experience but am not sure if they will affect anything downstream or if they might be useful to you.

Binary encoding precision : 32 bit Write index, Use zlib compression, TPP Compatibility all selected No filter selected Filer titleMaker , Parameters : <RunId>. <ScanNumber>.<ScanNumber>.<ChargeState> File"<SourcePath>".NativeID.<Id>

Thanks for the help :)

MKoesters commented 4 years ago

Hi,

Your conversion seems to be fine, no problem here.

Accessing the polarity is a bit awkward in the current version. The ion mode is basically a boolean flag, so there is no <cvParam cvRef="MS" accession="MS:1000130" name="ion mode" value='negative'/> or the other way around, just <cvParam cvRef="MS" accession="MS:1000130" name="positive scan" /> and <cvParam cvRef="MS" accession="MS:1000130" name="negative scan" />, so you need to check if either of those exists (or only one and if not assume its the other)

After merging #220 is would work like this:

In [4]: if spec['negative scan'] is True: 
   ...:     polarity = 'positive' 
   ...: elif spec['positive scan'] is True: 
   ...:     polarity = 'negative' 
   ...: else: 
   ...:     polarity = 'unknown'  # there is neither a tag for positive nor negative scan mode

For your file (tested for 200414_HILIC_neg_Phospholipid_Soy-Soybean\ Phospholipid.mzML), this would also work with the current version:

In [4]: if spec['negative scan'] == '': 
   ...:     polarity = 'negative' 
   ...: elif spec['positive scan'] == '': 
   ...:     polarity = 'positive' 
   ...: else: 
   ...:     polarity = 'unknown'  # there is neither a tag for positive nor negative scan mode ....

However another file I tested this with broke. Seems like a cvParams like<cvParam cvRef="MS" accession="MS:1000129" name="negative scan" value=""/> would currently work, however <cvParam cvRef="MS" accession="MS:1000129" name="negative scan"/> will not and both is valid. #220 will allow both cases to work.

This afternoon I'll hopefully have time to look into the other issue of retrieving the instrument configuration.

Best, Manuel

ghost commented 4 years ago

Hi,

Thanks for the feedback. For now, everything works smoothly ( I have already started testing out MS1 data). If the other parameters are simply not possible to extract, I can make do without them and manually add them except for maybe fragmentation type(I had wrongly labeled it as collision chamber type). At the end of the day, you have already done more than enough and I appreciate it.

ghost commented 4 years ago

Hi,

Once again, thanks for your help, I have realised that the other information I had requested for extraction can be obtained by other manual means (by reading the instrument program). I will close the issue.

Best KSachi

pymzml / pymzML

Spectrum Information #219