levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Add raw fasta file header to parsers #120

Closed caetera closed 10 months ago

caetera commented 10 months ago

This PR adds an entry to FASTA file parsers that contains raw (i.e. non-parsed header). The latter is used by the write to write parsed protein entries.

Useful, for the fast filtering of FASTA files, for example:

from pyteomics import fasta
fasta_entries = fasta.UniProt('uniprot_multi_organism.fasta')
fasta.write([protein for protein in fasta_entries if protein.description.get('OX', -1) == 123], 'uniprot_single_organism.fasta')
levitsky commented 10 months ago

Thank you, this looks great. I made a couple of minor changes, if you don't mind.

I also looked into adding this via a metaclass so that this is injected in every FlavoredMixin subclass automatically, but do we have a universal way to add a proper metaclass? It looks like the add_subclass add_metaclass routine actually modifies the class itself and does not affect its descendants, unless I'm missing something, which is totally possible since I don't deal with metaclasses often.

mobiusklein commented 10 months ago

@levitsky I'm not sure what the add_subclass name you're referring to is. By "universal way to add a proper metaclass", do you mean syntactically to deal with the differences between Py2 and Py3? There's six.add_metaclass, which produces functionally the same thing.

levitsky commented 10 months ago

@mobiusklein thanks for chiming in. I mean the function auxiliary.utils.add_metaclass, and yes, I was referring to a way to make it work on Py2 and Py3. We've been avoiding adding six as a dependency and I thought utils.add_subclass was functionally equivalent?

mobiusklein commented 10 months ago

Yes, pyteomics.auxiliary.utils.add_metaclass is functionally equivalent to six.add_metaclass. I forgot I added that (probably when I wrote the mztab module). It does affect subclasses, because the object bound to the name of the class you decorate isn't the type you wrote out syntactically. add_metaclass creates a copy of that type object's internal data structures, and then creates a new type object that is an instance of the metaclass instead of the built-in type.

So to accomplish what you're trying to do with _add_raw_field with a metaclass, you'd write something like this:

class _FastaParserFlavorMeta(type):
    def __new__(mcs, name, bases, namespace):
        if "parse" in namespace:
            namespace["parse"] = _add_raw_field(namespace["parse"])
        return super().__new__(mcs, name, bases, namespace)

@add_metaclass(_FastaParserFlavorMeta)
class FlavorMixin(...):
    ...

Here, namespace is a dict of all the names bound to the class definition, e.g. methods, class attributes, annotations, docstring, and a few other things. If the class defines a parse method, it'll be in the namespace. Otherwise, that'll be resolved by the usual MRO from bases. If every class you create that defines a parse method is an instance of _FastaParserFlavorMeta, they'll always have this wrapped in _add_raw_field. If you were to use a mixin type that is not a _FastaParserFlavorMeta to inject your parse method won't be wrapped unless you explicitly write a forwarding method like so:

class ExternalParseMixin:
    def parse(self, *args, **kwargs):
        ...

# Inherit `_FastaParserFlavorMeta` from `FlavorMixin`, 
class NewFlavorMixin(FlavorMixin, ExternalParseMixin):
    def parse(self, *args, **kwargs):
        return ExternalParseMixin.parse(self, *args, **kwargs)

There are a few other things we did to the FASTA parsers over the years we might be able to make less repetitive with a metaclass, like _add_init, but I'd need to re-read it to remember what problem was.

levitsky commented 10 months ago

@mobiusklein Thank you for clearing this up. What I was apparently missing in regard to the effect of metaclasses on the subclasses is that it's only there if the metaclass returns an instance of a type other than type. I didn't realize this distinction was important, seeing the examples of metaclasses that just used type() to return an altered class definition.

For instance, this metaclass:

class NotGreatMeta(type):
    def __new__(cls, name, bases, attrs):
        print('called for', name)
        return type(name, bases, attrs)

does not print anything for a subclass of a class, but this does:

class SeriousMeta(type):
    def __new__(cls, name, bases, attrs):
        print('called for', name)
        return super().__new__(cls, name, bases, attrs)

while both work for the very class to which they are applied. The reason is, apparently, that the fist one produces instances of type and the second one produces instances of SeriousMeta.


Now, trying to apply your example, I am running into another issue:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 from pyteomics import fasta

File ~/py/pyteomics/pyteomics/fasta.py:516
    511         _intify(info, ('n',))
    512         return info
    515 @_add_init
--> 516 class UniRef(UniRefMixin, FASTA):
    517     pass
    520 @_add_init
    521 class IndexedUniRef(UniRefMixin, TwoLayerIndexedFASTA):

TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

This happens with all the concrete classes that inherit both from FlavoredMixin and from FASTA or IndexedFASTA, as both of the latter are subclasses of FileReader, which in turn has abc.ABCMeta for metaclass. Apparently, since _FastaParserFlavorMeta is not a subclass of abc.ABCMeta, we have a metaclass conflict. As far as I can tell, ABCMeta was added to FileReader in order to register MzMLb as a subclass. Would you have any suggestions on resolving this? I imagine we could make _FastaParserFlavorMeta inherit from abc.ABCMeta, unless it has any undesirable consequences?

mobiusklein commented 10 months ago

It's safe to derive _FastaParserFlavorMeta from ABCMeta, with the caveat that if you decorate parser with abstractmethod, I think the current trick of patching the namespace will cause trouble, though this can be countered by only running that code when the type is not the base FlavorMixin.

Since that's not currently the design, that shouldn't be an issue, although semantically parser is treated like an abstract method, we just don't enforce it, nor do we need to make things more complicated.

levitsky commented 10 months ago

Thank you all! Looks ready to me.