levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Parsing mztab 2.0 sections in `MzTab` #22

Closed annalefarova closed 3 years ago

annalefarova commented 3 years ago

Class MzTab was extended to parse sections from mztab 2.0. The following sections were added: "small molecule feature table" and "small molecule evidence table".

mobiusklein commented 3 years ago

It looks like the __iter__ behavior being tested in tests/test_mztab.py needs to be updated:

    def test_iter(self):
        reader = mztab.MzTab(self.path)
        tables = list(reader)
        self.assertEqual(len(tables), 4)
        [self.assertEqual(len(t), 2) for t in tables]

Our test mzTab file doesn't have the small molecule tables in it, but the iteration behavior is now including those two in it. Is it possible to force the parser class to detect that given the schema version in the file metadata that a table should be present or not and so decide whether or not to yield it?

annalefarova commented 3 years ago

@mobiusklein, I've decided to follow the logic of the previous contributors, namely you always return all possible tables, but those not present in a file are empty. Now, for the two formats case, that means that if you are reading mztab1 you will get two additional empty tables. For the mztab2 you will get empty tables that are only present in mztab1. What do you think about that?

mobiusklein commented 3 years ago

I think the behavior required to parse a new format should not change the behavior of the parser of an old format. This kind of backwards incompatible change is undesirable.

You can determine which schema version you're dealing with by looking at mztab.MzTab(path).metadata['mzTab-version'].

I'd prefer to see a solution that does something like this:

import re

def determine_schema_version(self):
    version_raw = self.metadata['mzTab-version']
    version_parsed, variant = re.search(r"(?P<schema_version>\d+.\d+.\d+)(?:-(?P<schema_variant>[MP]))?", ver).groups()
    if variant is None:
        variant = "P"
   version_parsed = [int(v) for v in version_parsed.split(".")]
   self.version = version_parsed
   self.variant = variant
   return version_parsed, variant

and then use this to decide which tables should be yielded in __iter__. That way mzTab-1.0.0 files behave identically to the way they were before and new mzTab-M files don't waste time showing the user unexpected and empty tables either. This could just be done by checking self.variant in __iter__.

mobiusklein commented 3 years ago

Sorry, forgot to explicitly say the negation in the prologue. We don't want a new format to break an old format.

annalefarova commented 3 years ago

Thanks for your comments. I've added the function that you've proposed to the MzTab class. However, I've decided not to overwrite self.version property just in case smb. relies on it being a string. I keep it in the additional num_version property. Now __iter__ returns tables depending on variant of mztab file.

levitsky commented 3 years ago

Thank you so much @annalefarova for this contribution and @mobiusklein for your comments! @mobiusklein, are you happy with the current implementation? Further feedback is very much welcome.

A couple of questions I have after a cursory look at this:

  1. mode and type properties are apparently not supported with mzTab-M, corresponding metadata entries are not listed in the spec. Should we check for variant in the property methods and return None for M or do some other error handling? (I think we probably should.)
  2. Should we expose other metadata as properties for mzTab-M? Like mzTab-ID or title? (I feel like it wouldn't hurt. By analogy, it would be None for mzTab 1.0)
  3. Should we maybe expose description for both variants, as it is common? For 1.0 it is also mandatory. (Again, I don't see why not.)
  4. I would prefer having a test for the new mzTab variant as well. Should we just grab the first example from HUPO-PSI/mzTab?
mobiusklein commented 3 years ago

mode and type properties are apparently not supported with mzTab-M, corresponding metadata entries are not listed in the spec. Should we check for variant in the property methods and return None for M or do some other error handling? (I think we probably should.)

This sounds reasonable. The type dichotomy was born from the separation of ID and quantification in proteomics, but the same signals are often used for both in metabolomics.

Should we expose other metadata as properties for mzTab-M? Like mzTab-ID or title? (I feel like it wouldn't hurt. By analogy, it would be None for mzTab 1.0)

Those are actually optional but permitted in mzTab 1.0 for both types, so they are just self.metadata.get instead of needing to do complicated error handling.

Should we maybe expose description for both variants, as it is common? For 1.0 it is also mandatory. _(Again, I don't see why not.)

It's reasonable if @annalefarova wants to do it.

Note - This next bit is just for fun.

To avoid the proliferation of repetitive property code , we could add six and simplify matters with some metaprogramming.

from six import add_metaclass

class MetadataBackedProperty(object):
    '''Our descriptor type which uses the instance's metadata attribute to carry its values'''
    def __init__(self, name, variant_required=None):
        self.name = name
        self.variant_required = variant_required

    def __repr__(self):
        return "{self.__class__.__name__}(name={self.name!r}, variant_required={self.variant_required})".format(self=self)

    def __get__(self, obj, objtype=None):
        if obj is None and objtype is not None:
            # So the property can be seen for what it is
            return self
        value = obj.metadata.get(self.name)
        if value is None and self.variant_required and obj.variant in self.variant_required:
            raise AttributeError("{0} is missing from a mzTab-\"{1}\" document where it is required!".format(
                self.name, obj.variant))
        return value

    def __set__(self, obj, value):
        obj.metadata[self.name] = value

    def __delete__(self, obj):
        del obj.metadata[self.name]

class MetadataPropertyAnnotator(type):
    '''A simple metaclass to do some class-creation time introspection
    and descriptor binding.

    Uses a list of strings or 3-tuples from :attr:`__metadata_properties__` to
    bind :class:`MetadataBackedProperty` onto the class during its creation.
    '''
    def __new__(mcls, name, bases, attrs):
        props = attrs.get('__metadata_properties__', [])
        # Gather from parent classes so we can use inheritance for overriding this
        # behavior too.
        for base in bases:
            props.extend(getattr(base, '__metadata_properties__', []))
        for prop in props:
            if isinstance(prop, str):
                prop_name = prop
                attr_name = prop_name.replace("mzTab-", '').replace('-', '_').lower()
                variant_required = None
            else:
                prop_name, attr_name, variant_required = prop
            # attach the property descriptor to the class definition
            attrs[attr_name] = MetadataBackedProperty(
                prop_name, variant_required=variant_required)
        return super(MetadataPropertyAnnotator, mcls).__new__(mcls, name, bases, attrs)

And then we can write the following:

@add_metaclass(MetadataPropertyAnnotator)
class MzTab(_MzTabParserBase):
    ...
    __metadata_properties__ = [
        ('mzTab-version', 'version', 'MP'),
        ('mzTab-mode', 'mode', 'P'),
        ('mzTab-type', 'type', 'P'),
        ('mzTab-ID', 'id', 'M'),
        'title',
        'description',
    ]

Presto, version, mode, type, id, title, and description properties are created for us automatically, with getters and validation by variant.

We could also go even further and gather the various collections and complex objects in the file metadata (sample_processing, instrument, software, e.t.c.) with collapse_properties and expose them with properties too, but this is just a sign that the metaprogramming poison is in too deep. Another time, maybe.

I would prefer having a test for the new mzTab variant as well. Should we just grab the first example from HUPO-PSI/mzTab?

A good idea. A simple test of whether the variant detection code is properly altering __iter__'s behavior.

mobiusklein commented 3 years ago

The new tests and behaviors look good to me. Thank you for implementing them.

annalefarova commented 3 years ago

Thank you @mobiusklein @levitsky for a lot of interesting information. What would you say, if for this pr we will stick to the current design and this additional functionality, that I could already use in my project? As for general refactoring, I would try to play with it, but metaclass is something new for me and I need more time. I would propose to make it with follow-up pr.

mobiusklein commented 3 years ago

As I said, the metaprogramming idea was just for fun. Thank you for expanding the scope of the work you were willing to do to complete the PR's requirements. I'm happy with the changes.

Metaclasses are an esoteric part of Python that don't get used often, so when I thought of this as a chance to showcase them, it seemed like a good idea. There's more mzTab-specific collection folding code that I think is needed for them to really shine though, so I can address that later.

levitsky commented 3 years ago

Thank you all again. I've made some minor changes and as long as everyone is OK with it, I'm ready to merge.

What I changed was to remove the empty string as substitute for missing metadata and add a couple more properties I mentioned in a previous comment.

mobiusklein commented 3 years ago

Good by me. I'm going to re-read the two specs this weekend and generate some fancier method-generating code based on the metaclass example I showed earlier (Inlining the six.add_metaclass so we don't have to add a dependency).

annalefarova commented 3 years ago

Thanks for your help. I would like to merge then. By the way, the newbie question, after the branch is merged, when the changes would be available at pypi?

levitsky commented 3 years ago

@mobiusklein thank you!

@annalefarova the changes are not available on PyPI until I make a new release and upload it. I usually only do it when I've finished everything that I have started working on, or when a serious issue is fixed; however, it's no problem to make a new release early if it makes things easier for you.

annalefarova commented 3 years ago

@levitsky Of course I understand, that it can take time. I'm just wondering, maybe you can say approximately how long might it take.

mobiusklein commented 3 years ago

@annalefarova I will be leaving variant in-place.

levitsky commented 3 years ago

@annalefarova I added a test to ensure the variant property is available, and made a new release.

annalefarova commented 3 years ago

That was fast =) Thanks a lot