Closed caetera closed 5 months ago
Thank you for reporting. Considering that only pre-defined key-value pairs are expected from this part of the header till the end, this change seems pretty safe to me.
I ran a comparison between the two regexes on a local copy of the whole UniProt and couple that the change fixes parsing for the couple hundred entries with spaces in gene name. I noticed no undesirable changes.
I have found some proteins in Uniprot that have a space in the gene name, for example, O64332
The regex currently used for parsing the header fails to parse
OS
,OX
, andname
groups, a possible solution is to change it toThis one parses the problematic entries correctly, but I am in doubt if it will introduce some other problems. It would be nice if you looked at it as well. You know... I got 99 problems, so I used regex and now I have 100