Update to regex for Uniprot Fasta heade rparser

levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.

http://pyteomics.readthedocs.io

Apache License 2.0

105 stars 34 forks source link

Update to regex for Uniprot Fasta heade rparser #75

Closed hguturu closed 1 year ago

hguturu commented 2 years ago

https://github.com/levitsky/pyteomics/blob/0a49c98316d92763b630ed371aa3e53d1ce5fec0/pyteomics/fasta.py#L405

The above regex will fail in the cases where the protein name has a dash. e.g. header = ">sp|O00453-10|LST1-10_HUMAN Isoform of O00453, Isoform 10 of Leukocyte-specific transcript 1 protein OS=Homo sapiens OX=9606 GN=LST1 PE=1 SV=2"

Since LST1-10_HUMAN doesn't won't match ((\w+)\s+([^=]*\S) and should be changed to ([-\w]+)\s+([^=]*\S).

Suggested fix: header_pattern = r'^(\w+)\|([-\w]+)\|([-\w]+)\s+([^=]*\S)((\s+\w+=[^=]+(?!\w*=))+)\s*$'

Let me know if this is a valid change or if the above name is not part of Uniprot header spec. I can submit a pull request if this makes sense and this is the right fix. @AmirAlavi as FYI.

levitsky commented 2 years ago

Hi @hguturu, There has been a very similar question on the mailing list. I mentioned there that the pattern was written in accordance with the spec which says entry names do not contain hyphens. May I ask if these records come from some software/resource or do you modify the records on your own? I am cautious about relaxing the pattern without a good reason as it may match something that it's not supposed to, unless this is a common use case that I don't know of. However, there is an easy way to get the behavior you need by subclassing the parser, as mentioned in my response on the mailing list, e.g.:

class MyIndexedUniProt(fasta.IndexedUniProt):
    header_pattern = r'^(\w+)\|([-\w]+)\|([-\w]+)\s+([^=]*\S)((\s+\w+=[^=]+(?!\w*=))+)\s*$'

and:

human_fasta = MyIndexedUniProt('HUMAN.fasta')

hguturu commented 1 year ago

Hi @levitsky, sorry for the delay. I was trying to chase down the origin of this header. It can be found in a uniprot file (see https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz). This file is available on https://www.uniprot.org/help/downloads in the UniProtKB table for Isoform sequences resource.

levitsky commented 1 year ago

Wow, thank you for the effort! I've downloaded the file and it appears that the entry names have been changed and now align to the spec. I don't see any entry names containing hyphens in the file. For example, the entry for the isoform O00453-10 now looks like this:

>sp|O00453-10|LST1_HUMAN Isoform 10 of Leukocyte-specific transcript 1 protein OS=Homo sapiens OX=9606 GN=LST1

so the specific isoform is no longer reflected in the entry name.

This leaves me kind of on the fence because, on the one hand, the spec is clearly being followed and there should not be these kind of issues in the future, but on the other hand, the current supported patterns are distinct enough that there should be no harm in relaxing it.

hguturu commented 1 year ago

Agree about it probably being best to stick with spec. Will update if I see this happening else where. If its frequent enough then might be worth revisiting.

hguturu commented 1 year ago

Hi, I am back and found these IDs in an official uniprot release. I think I was mixing up the files before since there are so many similar ones.

See https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606_additional.fasta.gz. It even has >sp|O00453-10|LST1_HUMAN Isoform of O00453, Isoform 10 of Leukocyte-specific transcript 1 protein OS=Homo sapiens OX=9606 GN=LST1.

And I noticed they even reference such naming in https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/README under the section Protein FASTA files (*.fasta and *_additional.fasta), so looks like it is a convention they use and not just a one off/bug.

So it might be worthwhile updating the regex to handle this case.

levitsky commented 1 year ago

Hi @hguturu, I think this is an example of valid syntax where dashes are present in the accession identifier (for isoforms), not in the "entry name". The current regex supports that, and the IndexedUniProt parser seems to index the file without issues, and I don't see any dashes in entry names in the file.

hguturu commented 1 year ago

Oh, you are right. I think this mystery file has been bothering me and I have forgotten where I got it and now it appears I am forgetting the original issue was regarding the entry name.