levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Reading mztab failing #8

Closed ypriverol closed 4 years ago

ypriverol commented 4 years ago

The mztab here:

ftp://ftp.pride.ebi.ac.uk/pride/data/proteomes/RPXD018241.1/out.mzTab is failing error:

Traceback (most recent call last):
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/click/core.py", line 1289, in invoke
    rv.append(sub_ctx.command.invoke(sub_ctx))
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/yperez/IdeaProjects/github-repo/BDP/qccalculator/qccalculator/cli.py", line 189, in mztabreport
    report.add_mztab_files(mztab)
  File "/Users/yperez/IdeaProjects/github-repo/BDP/qccalculator/qccalculator/fullreport.py", line 26, in add_mztab_files
    mztab_tables = mztab.MzTab(mztab_file)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pyteomics/mztab.py", line 192, in __init__
    self._transform_tables()
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pyteomics/mztab.py", line 298, in _transform_tables
    self.spectrum_match_table = self.spectrum_match_table.as_df('PSM_ID')
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pyteomics/mztab.py", line 142, in as_df
    table = pd.DataFrame(data=self.rows, columns=self.header)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pandas/core/frame.py", line 474, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 461, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "/Users/yperez/local-apps/miniconda3/envs/qccalculator/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 500, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 24 columns passed, passed data had 23 columns
levitsky commented 4 years ago

Hi! It looks like the structure of the PSM table in the file is broken and one column is missing in all rows. I think the extra field in the header is search_engine_score[2] and the rows have only one search score. At least that is where the discrepancy seems to start in the columns:

...previous columns... search_engine search_engine_score[1] search_engine_score[2] modifications
...previous columns... [, , Percolator, 3.02] 0.492759 7-UNIMOD:21 233.198332109477
...previous columns... [, , Percolator, 3.02] 0.463187 2-UNIMOD:21 367.957131867222
...previous columns... [, , Percolator, 3.02] 0.262844 2-UNIMOD:21 537.267531616136

According to the metadata section, search_engine_score[1] is for Comet and search_engine_score[2] is for Percolator. The search engine is reported as [, , Percolator, 3.02]. Removing search_engine_score[1] or search_engine_score[2] header will allow to parse the file successfully.