levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Hard-code UniRef keys, remove extra fields created by str.split() #102

Closed levitsky closed 1 year ago

levitsky commented 1 year ago

This PR changes the UniRef header parser in a way similar to what #93 did for Uniprot, only recognizing the keys described in the UniProt specification for UniRef.

In addition to changing the pattern, it removes extra keys and values that were produced by splitting the values of UniqueIdentifier and RepresentativeMember on _. The latter was resulting in errors on the Uniref database downloaded from uniprot.org, so apparently the parser wasn't getting much use, and the change won't affect anyone.

However, some alternatives can still be discussed, like providing extra keys only when splitting is successful. This would result in errors in user code when a key is suddenly missing, as opposed to these keys consistently absent in the output with the currently proposed change.