levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
115 stars 35 forks source link

mzTab conversion to dataframe #42

Closed eeko-kon closed 3 years ago

eeko-kon commented 3 years ago

Hello! I am generating mzTab files from sirius, that have MTD, SMH and SML initials , but utf8 encoding only recognizes the SMH (most importantly, it completely skips the SML part of the data). Any suggestions? This is my script:

import pandas as pd
import numpy as np
import sys
import pyteomics
from pyteomics import mztab
filename= "./wf_testing/out_sirius_test.mzTab"
df=  pyteomics.mztab.MzTab(filename, encoding='UTF8', table_format='df')
sirius = pd.DataFrame(sirius)
print(sirius)

output:

0  PRT              Empty DataFrame
Columns: []
Index: []
1  PEP              Empty DataFrame
Columns: []
Index: []
2  PSM              Empty DataFrame
Columns: []
Index: []
3  SML     identifier chemical_formula smiles inchi_ke...

This is part of the file generated from sirius (.mzTab):

MTD mzTab-version   1.0.0
MTD mzTab-mode  null
MTD mzTab-type  null
MTD description Sirius-4.6.0
MTD smallmolecule_search_engine_score[1]    [, , SiriusScore, ]
MTD smallmolecule_search_engine_score[2]    [, , TreeScore, ]
MTD smallmolecule_search_engine_score[3]    [, , IsotopeScore, ]
MTD ms_run[1]-location  data Thermo Orbitrap ID-X/FileFiltered Std/Agnes_POS_MDNA_WGS_103_Filtered.mzML

SMH identifier  chemical_formula    smiles  inchi_key   description exp_mass_to_charge  calc_mass_to_charge charge  retention_time  taxid   species database    database_version    spectra_ref search_engine   best_search_engine_score[1] best_search_engine_score[2] best_search_engine_score[3] modifications   opt_global_adduct   opt_gobal_precursorFormula  opt_global_rank opt_global_explainedPeaks   opt_global_explainedIntensity   opt_global_median_mass_error_fragment_peaks_ppm opt_global_median_absolute_mass_error_fragment_peaks_ppm    opt_global_mass_error_precursor_ppm opt_global_compoundId   opt_global_compoundScanNumber   opt_global_featureId    opt_global_native_id
SML null    C17H25BN2O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M + H3N + H]+  C17H28BN3O2S    1   18  0.904773718693427   3.793037970198739   5.609411499914324   -4.038155963543013  745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C17H28BN3O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M + H]+    C17H28BN3O2S    1   18  0.904773718693427   3.793037970198739   5.609411499914324   -4.038155963543013  745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C17H30BN3O3S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    39.4614114117431    39.4614114117431    0.0 null    [M - H2O + H]+  C17H28BN3O2S    1   19  0.904773718693427   3.793037970198739   5.609411499914324   -5.157935070801511e04   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C15H26FN3O2S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    35.360942231040987  35.360942231040987  0.0 null    [M + H3N + H]+  C15H29FN4O2S    2   17  0.824253005426175   1.643124931790097   7.92227946447542    6.370360784334207   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
SML null    C15H31FN4O3S    null    null    null    349.20902642757801  null    null    123.805954255979998 null    null    null    null    null    null    35.360942231040987  35.360942231040987  0.0 null    [M - H2O + H]+  C15H29FN4O2S    2   18  0.824253005426175   1.643124931790097   7.92227946447542    -5.156894219126724e04   745 746 id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746

Best, Efi

levitsky commented 3 years ago

Hi Efi, thanks for reaching out.

I copied your excerpt and parsed it and I got the data back:

In [1]: from pyteomics import mztab

In [2]: sirius = mztab.MzTab('excerpt.mzTab')

In [3]: sirius.metadata
Out[3]: 
OrderedDict([('mzTab-version', '1.0.0'),
             ('mzTab-mode', None),
             ('mzTab-type', None),
             ('description', 'Sirius-4.6.0'),
             ('smallmolecule_search_engine_score[1]', 'SiriusScore'),
             ('smallmolecule_search_engine_score[2]', 'TreeScore'),
             ('smallmolecule_search_engine_score[3]', 'IsotopeScore'),
             ('ms_run[1]-location',
              'data Thermo Orbitrap ID-X/FileFiltered Std/Agnes_POS_MDNA_WGS_103_Filtered.mzML')])

In [4]: sirius.small_molecule_table
Out[4]: 
  identifier chemical_formula smiles inchi_key description  ...  opt_global_mass_error_precursor_ppm opt_global_compoundId opt_global_compoundScanNumber    opt_global_featureId                          opt_global_native_id
0       None     C17H25BN2O2S   None      None        None  ...                            -4.038156                   745                           746  id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
1       None     C17H28BN3O2S   None      None        None  ...                            -4.038156                   745                           746  id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
2       None     C17H30BN3O3S   None      None        None  ...                        -51579.350708                   745                           746  id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
3       None     C15H26FN3O2S   None      None        None  ...                             6.370361                   745                           746  id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746
4       None     C15H31FN4O3S   None      None        None  ...                        -51568.942191                   745                           746  id_6128946280250909851  controllerType=0 controllerNumber=1 scan=746

[5 rows x 31 columns]

I don't think you should convert the MzTab object into a dataframe. It already contains several dataframes, which you can access with attributes, such as small_molecule_table.

eeko-kon commented 3 years ago

Dear Levitzky,

Ah, I misunderstood. Thank you so much! Very easy.

Efi.