mwang87 / MassQueryLanguage

The Mass Spec Query Language (MassQL) is a domain specific language meant to be a succinct way to express a query in a mass spectrometry centric fashion.
https://mwang87.github.io/MassQueryLanguage_Documentation/
MIT License
38 stars 8 forks source link

MALDI and MSQL #119

Open chasemc opened 3 years ago

chasemc commented 3 years ago

It's turning out to be quite difficult to get MALDI data into a format that works with MSQL.

Note: I don't currently have access to any Vendor software to see what it can export.

My first go-to's for working with MALDI data (and what I recommend for others) are mmass (http://www.mmass.org/) and MALDIquant (https://github.com/sgibb/MALDIquant)

MALDIquant

I tried peak picking and exporting with MALDIquant but it can't export peaks into mzml/mzxml. It can export csv/tsv but MSQL doesn't have a parser for those. MALDIquant error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘exportMzMl’ for signature ‘"MassPeaks"’

I forced an export by changing to a "massSpectrum" object. But get the following error from MSQL when trying to query that file:

Namespace(cache='NO', extract_json=None, extract_mzML=None, filename='/home/chase/delet/trial.mzML', original_path=None, output_file=None, parallel_query='NO', query='QUERY scaninfo(MS1DATA)')
{
    "querytype": {
        "function": "functionscaninfo",
        "datatype": "datams1data"
    },
    "conditions": [],
    "query": "QUERY scaninfo(MS1DATA)"
}
[Warning] Not index found and build_index_from_scratch is False
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "workflow/bin/msql_cmd.py", line 115, in <module>
    main()
  File "workflow/bin/msql_cmd.py", line 44, in main
    results_df = msql_engine.process_query(query, 
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 176, in process_query
    return _evalute_variable_query(parsed_dict, input_filename, cache=cache, parallel=parallel)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 252, in _evalute_variable_query
    ms1_df, ms2_df = msql_fileloading.load_data(input_filename, cache=cache)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 41, in load_data
    ms1_df, ms2_df = _load_data_mzML2(input_filename)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 290, in _load_data_mzML2
    rt = spec.scan_time_in_minutes()
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/spec.py", line 885, in scan_time_in_minutes
    self._scan_time, time_unit = self.scan_time
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/spec.py", line 869, in scan_time
    self._scan_time = float(scan_time_ele.attrib.get("value"))
AttributeError: 'NoneType' object has no attribute 'attrib'

mmass

Imported mzml spectrum, did peak-picking in mmass and then attempted mgf export (it's either csv or mgf). But trying to query with MSQL:


Namespace(cache='NO', extract_json=None, extract_mzML=None, filename='/home/chase/delet/massive.ucsd.edu/MSV000084291/MSV000081619/bs3610_a_2.mgf', original_path=None, output_file=None, parallel_query='NO', query='QUERY scaninfo(MS1DATA)')
{
    "querytype": {
        "function": "functionscaninfo",
        "datatype": "datams1data"
    },
    "conditions": [],
    "query": "QUERY scaninfo(MS1DATA)"
}
Traceback (most recent call last):
  File "workflow/bin/msql_cmd.py", line 115, in <module>
    main()
  File "workflow/bin/msql_cmd.py", line 44, in main
    results_df = msql_engine.process_query(query, 
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 176, in process_query
    return _evalute_variable_query(parsed_dict, input_filename, cache=cache, parallel=parallel)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 252, in _evalute_variable_query
    ms1_df, ms2_df = msql_fileloading.load_data(input_filename, cache=cache)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 50, in load_data
    ms1_df, ms2_df = _load_data_mgf(input_filename)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 89, in _load_data_mgf
    peak_dict["scan"] = spectrum.metadata["scans"]
KeyError: 'scans'```
robinschmid commented 3 years ago

Why not use imzML? It is very similar to mzML and adds imaging-specific information to it.

chasemc commented 3 years ago

I was trying to work within the confines of formats that already had parsers in MassQueryLanguage. Yesterday @mwang87 got mgf working from mmass exports. But I agree with you @robinschmid, it would also be useful to have for imaging.

Naive question- I assume imzML supports centroided data?

robinschmid commented 3 years ago

its based on mzML and supports centroid data: https://ms-imaging.org/wp/imzml/

there is jimzML parser and the pyimzml parser (https://github.com/alexandrovteam/pyimzML)

mwang87 commented 3 years ago

Do you all have any example queries and data you'd want to use as an example? If so then it should be reasonable to add support. I just don't have any data in that format.

robinschmid commented 3 years ago

Corinna might have some interesting imaging data with