levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

question about ms2 reader #149

Closed colin-combe closed 2 months ago

colin-combe commented 2 months ago

hi, I having a problem reading an ms2 file, basically, i'm doing this:

from pyteomics import ms2
reader = ms2.read(path, use_index=True)
spectrum = reader[0]

should that work?

an excerpt from the file I'm trying to read is attached (file extension changed to .txt so i can upload) thanks, Colin test.txt

levitsky commented 2 months ago

Hi Colin,

yeah, that should work but this line resulted in an exception:

I   Filter  

The parser expected every line that starts with I to have at least two terms after it, forming a key-value pair. I made a change to avoid an exception when there is no value, saving it as an empty string.

colin-combe commented 2 months ago

great, thanks.

Another question - should there be one entry in reader[n] for every spectrum in the ms2 file? Or is it only a subset?

For the larger file I have, there are 15477 lines that start "S" in the file, but only 13322 items end up in the index created by the ms2 reader?

levitsky commented 2 months ago

If there are fewer spectra in the index than in the file, my guess would be that there are spectra with repeating (non-unique) spectrum IDs, in which case one replaces the other in the index. MS1 and MS2 parsers use the next word after S as the spectrum ID, assuming it is unique. In your example file, it is indeed not unique, with 687 spectra and 682 unique IDs, so the reader object only remembers 682 items.

One way to deal with this is not using indexing at all, if you don't need it. A sequential parser can go through all spectra without any issues. Another way is to redefine the field the indexing parser uses as the key for indexing, like include the whole S line. However even that can possibly repeat on rare occasions.

P.S. Is your file combined from several raw files? Just curious why the issue occurs in the first place.

colin-combe commented 2 months ago

maybe it should be considered a problem with the ms2 file? (I don't know where ms2 is defined)

levitsky commented 2 months ago

I don't have much experience with MS2 files, but from what I gathered at the time, the field used for indexing is the actual scan index, and so this problem should not normally occur, unless you do something strange like trying to produce in silico spectra, or just combining multiple files into one. I'm curious where this file came from because if my assumptions are wrong, then perhaps I should address it in Pyteomics.

Anyway, if you want to keep the indexing and have all your items, here's an example that uses the whole S line as scan ID:

In [1]: from pyteomics import ms2

In [2]: class MyMS2(ms2.IndexedMS2):
   ...:     label = r'S\s+(.*\S)'
   ...: 

In [3]: reader = MyMS2('test.txt')

In [4]: len(reader)
Out[4]: 687
colin-combe commented 2 months ago

I'm curious where this file came from because if my assumptions are wrong, then perhaps I should address it in Pyteomics.

It comes from SCOUT.

I'm told the file is a valid ms2 file (and that both the duplicate IDs and I Filter are allowed) and have been given this as a reference for ms2 - https://proteome.gs.washington.edu/software/bibliospec/v1.0/documentation/fileFormats.html#:~:text=MS2%20format%3A%20This%20is%20a,and%20the%20precursor%20m%2Fz (which i don't find super informative tbh)

Anyway, if you want to keep the indexing and have all your items, here's an example that uses the whole S line as scan ID:

Thanks for looking in to this. I haven't tested it but I expect it to work, I'll let you know if i have problems.

Would it be possible to have a release of pyteomics with the fix for I Filter? Our project uses pip to manage dependencies so a release makes things a bit easier for us, best wishes, Colin

colin-combe commented 2 months ago

an example that uses the whole S line as scan ID

i guess that could still fail (if whole S line is duplicated), but the chances are pretty small

levitsky commented 2 months ago

I've made a release (4.7.2) which is already on PyPI. As for the MS2 description and the Scout software, not much is clear to me, but the article introducing the formats states that the numbers after S are "start scan" and "end scan", followed by the precursor m/z.

colin-combe commented 2 months ago

I've tested and it works.

As for the MS2 description and the Scout software, not much is clear to me, but the article introducing the formats states that the numbers after S are "start scan" and "end scan", followed by the precursor m/z.

Maybe I'm being silly, but there is still the remote possibility that all those numbers could be the same (assuming what Scout does is allowed) and then we would miss a spectrum? If combining from multiple files is allowed maybe it would be good to change the the way MS2 parser works so that it is based only on the position in the list of spectra, not the contents of the S line?

Anyway, thanks for all your help, Colin

levitsky commented 2 months ago

there is still the remote possibility that all those numbers could be the same

Yes, I agree.

If combining from multiple files is allowed

I don't think there is an authority on whether it is allowed, however I am sure that doing so, while very easy, will cause issues in many scenarios. For example, just about every software that I know that does peptide and/or protein identification will produce an output file where it records the ID of the spectrum, the assigned sequence, and in some cases the protein ID. If the spectrum ID in MS2 file, for example, or the protein ID in the FASTA file, is not unique, then the information about the discovery is not recoverable. So, implicitly it is assumed that spectrum IDs are unique.

Scout is, as far as I understand, a kind of search engine as well, so the point above should apply to it. I can see that it can use MS2 files as input though, I'm not sure if it also produces them and whether it performs concatenation. But in general, if concatenating multiple files like this, one has to take measures to address issues that can arise in the particular workflow.

With regard to support in Pyteomics, I think we can, in theory, implement an alternative indexing mechanism that is purely positional. It would then be possible to use it with any text-based format parser that we have. Whether or not this is called for, I am really not sure based on this, because like I said,

  1. Pyteomics will rarely be the only source of problems in this scenario
  2. Pyteomics already has a sequential parser for every text format that is capable of reading all items just fine (just not index them)

I am totally open to discuss this idea, though.

colin-combe commented 2 months ago

@diogobor - whats your view on this? Do you plan to keep using files like this with Scout? Col

diogobor commented 2 months ago

@colin-combe and @levitsky the line S starts with the unique scan number for each tandem mass spectrum. S number number means the initial and the final scan number for a set of fragment ions. Usually, this number is the same if the dataset has been acquired from DDA.

@colin-combe I sent you a ms2 file that I combined different ms2s. It means that I had a mistake when the file I sent you (from Scout) didn't have a unique scan number.

In my view, @levitsky Pyteomics must consider the S line as unique scan number, and the lib can iterate over the set of spectra based on the scan number read from S lines. (and not from the index of list).

colin-combe commented 2 months ago

In my view, @levitsky Pyteomics must consider the S line as unique scan number, and the lib can iterate over the set of spectra based on the scan number read from S lines. (and not from the index of list).

@diogobor - i think that's the behaviour we have now using the subclass of ms2.IndexedMS2 from above