mwang87 / MassQueryLanguage

The Mass Spec Query Language (MassQL) is a domain specific language meant to be a succinct way to express a query in a mass spectrometry centric fashion.
https://mwang87.github.io/MassQueryLanguage_Documentation/
MIT License
38 stars 8 forks source link

Potential error with mzml parser and MALDI data #92

Closed chasemc closed 3 years ago

chasemc commented 3 years ago

I haven't dug into it but it looks like the mzml parser is having issues with MALDI data when it contains data from more than one spot, a file with a single spectrum seems to have worked but I haven't tested more yet.

To reproduce:

I took the raw data in https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=7ce7c09a174545a4a7dfe80af25329b0 and converted it using the default settings in msconvert (fresh install) (Protein_Data.zip)


Environment Setup

git clone git@github.com:mwang87/MassQueryLanguage.git

cd MassQueryLanguage

conda create --name msql python=3.8
pip3 install -r requirements.txt
conda install nextflow

# get data
cd ~workflow/test
bash workflow/test/get_data.sh

Test (all relative to 'MassQueryLanguage' directory)


python3  workflow/bin/msql_cmd.py \
    "workflow/test/Protein_Data.mzML" \
    "QUERY scaninfo(MS1DATA)" \
    --output_file "output.tsv" \
    --parallel_query NO \
    --cache NO

Error:


/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py:57: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  warnings.warn(
Namespace(cache='NO', extract_json=None, extract_mzML=None, filename='/data/bruker_autoflex/Protein_Data.mzML', original_path=None, output_file='output.tsv', parallel_query='NO', query='QUERY scaninfo(MS1DATA)')
{
    "querytype": {
        "function": "functionscaninfo",
        "datatype": "datams1data"
    },
    "conditions": [],
    "query": "QUERY scaninfo(MS1DATA)"
}
Traceback (most recent call last):
  File "workflow/bin/msql_cmd.py", line 85, in <module>
    main()
  File "workflow/bin/msql_cmd.py", line 44, in main
    results_df = msql_engine.process_query(query, 
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 173, in process_query
    return _evalute_variable_query(parsed_dict, input_filename, cache=cache, parallel=parallel)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_engine.py", line 236, in _evalute_variable_query
    ms1_df, ms2_df = msql_fileloading.load_data(input_filename, cache=cache)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 40, in load_data
    ms1_df, ms2_df = _load_data_mzML(input_filename)
  File "/home/chase/Documents/github/MassQueryLanguage/msql_fileloading.py", line 252, in _load_data_mzML
    run = pymzml.run.Reader(input_filename, MS_precisions=MS_precisions)
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/run.py", line 120, in __init__
    self.info["file_object"] = self._open_file(path_or_file)
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/run.py", line 222, in _open_file
    return FileInterface(
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/file_interface.py", line 28, in __init__
    self.file_handler = self._open(path)
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/file_interface.py", line 58, in _open
    return standardMzml.StandardMzml(
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/file_classes/standardMzml.py", line 60, in __init__
    self.seek_list = self._read_extremes()
  File "/home/chase/miniconda3/envs/msql/lib/python3.8/site-packages/pymzml/file_classes/standardMzml.py", line 660, in _read_extremes
    last_scan = int(re.search(b"[0-9]*$", id_match.group("id")).group())
ValueError: invalid literal for int() with base 10: b''
mwang87 commented 3 years ago

Oh this is super interesting, I've honestly never opened an mzML from real MALDI data. Let me take a look and worst case, we might have to pull in a different parser for this kind of data.

mwang87 commented 3 years ago

OK yeah I think pymzml can't handle non-numerical data. I think w'ell have to switch to pyteomics instead of pymzml. However, this does work when converting to mzXML. Could you give that a go?

I gave it a try with this command:

python ./msql_cmd.py 
test/Protein_Data.mzXML 
"QUERY scaninfo(MS1DATA)"

and got this

scan   rt  mslevel            i  query_index
0    1  0.0        1  367336682.0            0
1    2  0.0        1  344296471.0            0
2    3  0.0        1  395128629.0            0
3    4  0.0        1  318670421.0            0
4    5  0.0        1  513163051.0            0
5    6  0.0        1  448719899.0            0
6    7  0.0        1  317714726.0            0
7    8  0.0        1  622034997.0            0
chasemc commented 3 years ago

Yeah, with mzxml I got the same as you

mwang87 commented 3 years ago

Awesome, yeah play around with some queries in mzXML then, hopefully it finds the right things that you're looking for!