levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

New problem with parsing FASTA files #136

Closed tristanbrown closed 6 months ago

tristanbrown commented 6 months ago

After the release of https://github.com/levitsky/pyteomics/pull/120 in v4.6.2, I now get the following error traceback when trying to parse fastas:

99 with _get_filesystem(fasta_uri).open(fasta_uri, "r") as fastafile:
    100     results = []
--> 101     for description, sequence in MyUniProt(fastafile):
    102         description['sequence'] = sequence
    103         results.append(description)

File /opt/conda/lib/python3.8/site-packages/pyteomics/auxiliary/file_helpers.py:178, in IteratorContextManager.__next__(self)
    176 def __next__(self):
    177     # try:
--> 178     return next(self._reader)

File /opt/conda/lib/python3.8/site-packages/pyteomics/fasta.py:232, in FASTA._read(self)
    230     sequence = sequence[:-1]
    231 if self.parser is not None:
--> 232     description = self.parser(description)
    233 yield Protein(description, sequence)
    234 accumulated_strings = [stripped_string[1:]]

File /opt/conda/lib/python3.8/site-packages/pyteomics/fasta.py:144, in _add_raw_field.<locals>._new_parser(instance, descr)
    142     parsed[RAW_HEADER_KEY] = descr
    143 else:
--> 144     raise aux.PyteomicsError('Cannot save raw protein header, since the corresponsing'
    145                             'key ({}) already exists.'.format(RAW_HEADER_KEY))
    146 return parsed

PyteomicsError: Pyteomics error, message: 'Cannot save raw protein header, since the corresponsingkey (__raw__) already exists.'

MyUniProt is just a custom parser with a more robust regex pattern:

class MyUniProt(fasta.UniProt):
    """Redefine the header-parsing pattern to tolerate '-' in the entry field."""

    header_pattern = r'^(?P<db>\w+)\|(?P<id>[-\w]+)\|(?P<entry>[-\w]+)\s+(?P<name>.*?)(?:(\s+OS=(?P<OS>[^=]+))|(\s+OX=(?P<OX>\d+))|(\s+GN=(?P<GN>\S+))|(\s+PE=(?P<PE>\d))|(\s+SV=(?P<SV>\d+)))*\s*$'

    def parser(self, header):
        """
        Catch errors when parsing a header and return a simpler dict; this allows
        parsing FASTAs where not all entries are in a valid Uniprot format.
        """
        try:
            return fasta.UniProt.parser(self, header)
        except:
            _logger.warning("Error parsing header: %s", header, exc_info=True)
            return {
                "id": header,
                "entry": header,
            }

This parsing works without a problem in v4.6.1.

mobiusklein commented 6 months ago

I see, the parent method is already wrapped, but the metaclass cannot tell, so it tries to wrap it again. The interim solution is to remove fasta.RAW_HEADER_KEY from the return value of fasta.UniProt.parser(self, header) before returning it.

The longer term solution would be to modify the check in the _add_raw_field wrapper so that if the fasta.RAW_HEADER_KEY key is present, if its value is the same as the string we would assign to it otherwise, don't throw an error.

levitsky commented 6 months ago

Thank you @tristanbrown for reporting and @mobiusklein for your suggestion, I tried implementing it in https://github.com/levitsky/pyteomics/commit/f9d7f7c83d351147a84df3d8c7923a9cfda397f5.

@tristanbrown could you try the latest master and see if it works for you?

tristanbrown commented 6 months ago

@levitsky Yes, the latest master branch fixes my issue. Thanks!