Failing to parse UniProt description with space in gene symbol

caetera commented 5 months ago

I have found some proteins in Uniprot that have a space in the gene name, for example, O64332

sp|O64332|TIPL_BPN15 Tail tip protein L OS=Escherichia phage N15 OX=1604876 GN=gene 18 PE=3 SV=1

The regex currently used for parsing the header fails to parse OS, OX, and name groups, a possible solution is to change it to

^(?P<db>\\w+)\\|(?P<id>[-\\w]+)\\|(?P<entry>\\w+)\\s+(?P<name>.*?)(?:(\\s+OS=(?P<OS>[^=]+))|(\\s+OX=(?P<OX>\\d+))|(\\s+GN=(?P<GN>[^=]+))|(\\s+PE=(?P<PE>\\d))|(\\s+SV=(?P<SV>\\d+)))*\\s*$

This one parses the problematic entries correctly, but I am in doubt if it will introduce some other problems. It would be nice if you looked at it as well. You know... I got 99 problems, so I used regex and now I have 100

levitsky commented 5 months ago

Thank you for reporting. Considering that only pre-defined key-value pairs are expected from this part of the header till the end, this change seems pretty safe to me.

levitsky commented 5 months ago

I ran a comparison between the two regexes on a local copy of the whole UniProt and couple that the change fixes parsing for the couple hundred entries with spaces in gene name. I noticed no undesirable changes.

levitsky / pyteomics

Failing to parse UniProt description with space in gene symbol #139