levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

MGF File Loading Error - IndexedMGF Object Empty #134

Closed CCranney closed 6 months ago

CCranney commented 6 months ago

Hi,

I am trying to load a metabolome library file in .mgf format using pyteomics. I've used it before on other library files, but in this instance the IndexedMGF object created by pyteomics is empty. For instance, where the following will generally produce a value greater than 0 for any appropriately-formatted and populated .mgf file:

from pyteomics import mgf

lib = mgf.read('<path-to-file>')
print(len(lib))

metabolome.tar.gz

The above file is from GNPS. With this particular file it returns 0, despite there being spectra in the file that appear to have the appropriate format. No errors are thrown, and I've spent the past little while trying to step through the code or just straight comparing a working example with this metabolome file example, but can't seem to find why it's not being read properly. Any chance you could take a look and see what the issue may be?

levitsky commented 6 months ago

Hi @CCranney,

this MGF file doesn't have TITLEs for spectra, which is used as spectrum ID by default. A supported alternative is to use SCANS instead, which are present in this file, so this works:

In [1]: from pyteomics import mgf

In [2]: f = mgf.IndexedMGF(filename, index_by_scans=True)

In [3]: len(f)
Out[3]: 24563
CCranney commented 6 months ago

Works like a charm, thank you so much! Closing the issue.

CCranney commented 6 months ago

I'm considering the issue closed, but this could be an enhancement request to throw a warning error if an otherwise-normal file has the same problem.