Hard-code key=value pairs in Uniprot headers as described in the spec

levitsky commented 1 year ago

The Uniprot documentation specifies which key-value pairs can appear in the FASTA headers and in what order:

>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName OX=OrganismIdentifier [GN=GeneName ]PE=ProteinExistence SV=SequenceVersion

Up to now, the parser tried to match them loosely, relying on the assumption that = only occurs as a key-value delimiter in FASTA headers. That is not true, hence it is non-trivial to discern parts of the protein name from arbitrary keys and values. Example of breaking entry:

>tr|Q9S8M8|Q9S8M8_WHEAT FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 PE=1 SV=1

This PR changes the pattern so that only the documented keys are parsed in the order that they are listed in the specification. This allows correctly parsing some entries with = in their names, which previously raised errors. However, it may potentially break the parsing of some other entries in the wild, if they do not follow the specification closely enough.

This PR is an attempt to catch these cases. If you see this and decide to test it out and drop a comment, thank you!

P.S. If all is well, it makes sense to extend this also to UniRef.

levitsky commented 1 year ago

Pinging @radusuciu and @hguturu here as you were previously reporting FASTA-related issues.

mobiusklein commented 1 year ago

Is there a way to ensure order of the key value pairs does not matter? In the event a key=value pair is out of order it kills all other pairs preceding it.

Using the current pattern: '^(?P<db>\\w+)\\|(?P<id>[-\\w]+)\\|(?P<entry>\\w+)\\s+(?P<name>.*?)(\\s+OS=(?P<OS>[^=]+))?(\\s+OX=(?P<OX>\\d+))?(\\s+GN=(?P<GN>\\S+))?(\\s+PE=(?P<PE>\\d))?(\\s+SV=(?P<SV>\\d+))?\\s*$'

Parsing "tr|Q9S8M8|Q9S8M8_WHEAT FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 PE=1 SV=1"

{'db': 'tr',
 'id': 'Q9S8M8',
 'entry': 'Q9S8M8_WHEAT',
 'name': 'FRIII-2-VIII=GAMMA-gliadin (Fragment)',
 'OS': 'Triticum aestivum',
 'OX': '4565',
 'GN': None,
 'PE': '1',
 'SV': '1'}

Parsing "tr|Q9S8M8|Q9S8M8_WHEAT FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 SV=1 PE=1"

{'db': 'tr',
 'id': 'Q9S8M8',
 'entry': 'Q9S8M8_WHEAT',
 'name': 'FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 SV=1',
 'OS': None,
 'OX': None,
 'GN': None,
 'PE': '1',
 'SV': None}

I tried making a meta-group or-ing each key-value pattern together and then allowing the meta-group to repeat. Trying this abomination: '^(?P<db>\\w+)\\|(?P<id>[-\\w]+)\\|(?P<entry>\\w+)\\s+(?P<name>.*?)(?:(\\s+OS=(?P<OS>[^=]+))|(\\s+OX=(?P<OX>\\d+))|(\\s+GN=(?P<GN>\\S+))|(\\s+PE=(?P<PE>\\d))|(\\s+SV=(?P<SV>\\d+)))*\\s*$' Parsing "tr|Q9S8M8|Q9S8M8_WHEAT FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 PE=1 SV=1"

{'db': 'tr',
 'id': 'Q9S8M8',
 'entry': 'Q9S8M8_WHEAT',
 'name': 'FRIII-2-VIII=GAMMA-gliadin (Fragment)',
 'OS': 'Triticum aestivum',
 'OX': '4565',
 'GN': None,
 'PE': '1',
 'SV': '1'}

Parsing "tr|Q9S8M8|Q9S8M8_WHEAT FRIII-2-VIII=GAMMA-gliadin (Fragment) OS=Triticum aestivum OX=4565 SV=1 PE=1"

{'db': 'tr',
 'id': 'Q9S8M8',
 'entry': 'Q9S8M8_WHEAT',
 'name': 'FRIII-2-VIII=GAMMA-gliadin (Fragment)',
 'OS': 'Triticum aestivum',
 'OX': '4565',
 'GN': None,
 'PE': '1',
 'SV': '1'}

Are there other examples to test this on?

levitsky commented 1 year ago

Yes, I was also looking into it and came to the same OR'ing idea. The key order is part of the spec, so my expectation is that it should be respected (I have not yet seen examples to the contrary). However, if it's not catastrophically slower, we can just go with the more permissive version.

levitsky commented 1 year ago

Turns out the OR-ed version is even faster by about 25%.

The test was:

In [1]: from pyteomics import fasta

In [2]: with fasta.read('/home/lev/Downloads/fasta/sprot_human_decoy.fasta') as f:
   ...:     headers = [d for d, s in f]
   ...: 

In [3]: p = fasta.UniProtMixin()

In [4]: %%timeit
   ...: for d in headers:
   ...:     p.parser(d)
   ...: 
381 ms ± 7.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With previous version:

503 ms ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

levitsky / pyteomics

Hard-code key=value pairs in Uniprot headers as described in the spec #93