Closed annalefarova closed 3 years ago
It looks like the __iter__
behavior being tested in tests/test_mztab.py
needs to be updated:
def test_iter(self):
reader = mztab.MzTab(self.path)
tables = list(reader)
self.assertEqual(len(tables), 4)
[self.assertEqual(len(t), 2) for t in tables]
Our test mzTab file doesn't have the small molecule tables in it, but the iteration behavior is now including those two in it. Is it possible to force the parser class to detect that given the schema version in the file metadata that a table should be present or not and so decide whether or not to yield it?
@mobiusklein, I've decided to follow the logic of the previous contributors, namely you always return all possible tables, but those not present in a file are empty. Now, for the two formats case, that means that if you are reading mztab1 you will get two additional empty tables. For the mztab2 you will get empty tables that are only present in mztab1. What do you think about that?
I think the behavior required to parse a new format should not change the behavior of the parser of an old format. This kind of backwards incompatible change is undesirable.
You can determine which schema version you're dealing with by looking at mztab.MzTab(path).metadata['mzTab-version']
.
I'd prefer to see a solution that does something like this:
import re
def determine_schema_version(self):
version_raw = self.metadata['mzTab-version']
version_parsed, variant = re.search(r"(?P<schema_version>\d+.\d+.\d+)(?:-(?P<schema_variant>[MP]))?", ver).groups()
if variant is None:
variant = "P"
version_parsed = [int(v) for v in version_parsed.split(".")]
self.version = version_parsed
self.variant = variant
return version_parsed, variant
and then use this to decide which tables should be yielded in __iter__
. That way mzTab-1.0.0 files behave identically to the way they were before and new mzTab-M files don't waste time showing the user unexpected and empty tables either. This could just be done by checking self.variant
in __iter__
.
Sorry, forgot to explicitly say the negation in the prologue. We don't want a new format to break an old format.
Thanks for your comments. I've added the function that you've proposed to the MzTab
class. However, I've decided not to overwrite self.version
property just in case smb. relies on it being a string. I keep it in the additional num_version
property. Now __iter__
returns tables depending on variant
of mztab file.
Thank you so much @annalefarova for this contribution and @mobiusklein for your comments! @mobiusklein, are you happy with the current implementation? Further feedback is very much welcome.
A couple of questions I have after a cursory look at this:
mode
and type
properties are apparently not supported with mzTab-M, corresponding metadata entries are not listed in the spec. Should we check for variant in the property methods and return None for M
or do some other error handling? (I think we probably should.)mzTab-ID
or title
? (I feel like it wouldn't hurt. By analogy, it would be None
for mzTab 1.0)description
for both variants, as it is common? For 1.0 it is also mandatory. (Again, I don't see why not.)mode and type properties are apparently not supported with mzTab-M, corresponding metadata entries are not listed in the spec. Should we check for variant in the property methods and return None for M or do some other error handling? (I think we probably should.)
This sounds reasonable. The type dichotomy was born from the separation of ID and quantification in proteomics, but the same signals are often used for both in metabolomics.
Should we expose other metadata as properties for mzTab-M? Like mzTab-ID or title? (I feel like it wouldn't hurt. By analogy, it would be None for mzTab 1.0)
Those are actually optional but permitted in mzTab 1.0 for both types, so they are just self.metadata.get
instead of needing to do complicated error handling.
Should we maybe expose description for both variants, as it is common? For 1.0 it is also mandatory. _(Again, I don't see why not.)
It's reasonable if @annalefarova wants to do it.
To avoid the proliferation of repetitive property
code , we could add six
and simplify matters with some metaprogramming.
from six import add_metaclass
class MetadataBackedProperty(object):
'''Our descriptor type which uses the instance's metadata attribute to carry its values'''
def __init__(self, name, variant_required=None):
self.name = name
self.variant_required = variant_required
def __repr__(self):
return "{self.__class__.__name__}(name={self.name!r}, variant_required={self.variant_required})".format(self=self)
def __get__(self, obj, objtype=None):
if obj is None and objtype is not None:
# So the property can be seen for what it is
return self
value = obj.metadata.get(self.name)
if value is None and self.variant_required and obj.variant in self.variant_required:
raise AttributeError("{0} is missing from a mzTab-\"{1}\" document where it is required!".format(
self.name, obj.variant))
return value
def __set__(self, obj, value):
obj.metadata[self.name] = value
def __delete__(self, obj):
del obj.metadata[self.name]
class MetadataPropertyAnnotator(type):
'''A simple metaclass to do some class-creation time introspection
and descriptor binding.
Uses a list of strings or 3-tuples from :attr:`__metadata_properties__` to
bind :class:`MetadataBackedProperty` onto the class during its creation.
'''
def __new__(mcls, name, bases, attrs):
props = attrs.get('__metadata_properties__', [])
# Gather from parent classes so we can use inheritance for overriding this
# behavior too.
for base in bases:
props.extend(getattr(base, '__metadata_properties__', []))
for prop in props:
if isinstance(prop, str):
prop_name = prop
attr_name = prop_name.replace("mzTab-", '').replace('-', '_').lower()
variant_required = None
else:
prop_name, attr_name, variant_required = prop
# attach the property descriptor to the class definition
attrs[attr_name] = MetadataBackedProperty(
prop_name, variant_required=variant_required)
return super(MetadataPropertyAnnotator, mcls).__new__(mcls, name, bases, attrs)
And then we can write the following:
@add_metaclass(MetadataPropertyAnnotator)
class MzTab(_MzTabParserBase):
...
__metadata_properties__ = [
('mzTab-version', 'version', 'MP'),
('mzTab-mode', 'mode', 'P'),
('mzTab-type', 'type', 'P'),
('mzTab-ID', 'id', 'M'),
'title',
'description',
]
Presto, version
, mode
, type
, id
, title
, and description
properties are created for us automatically, with getters and validation by variant
.
We could also go even further and gather the various collections and complex objects in the file metadata (sample_processing
, instrument
, software
, e.t.c.) with collapse_properties
and expose them with properties too, but this is just a sign that the metaprogramming poison is in too deep. Another time, maybe.
I would prefer having a test for the new mzTab variant as well. Should we just grab the first example from HUPO-PSI/mzTab?
A good idea. A simple test of whether the variant detection code is properly altering __iter__
's behavior.
The new tests and behaviors look good to me. Thank you for implementing them.
Thank you @mobiusklein @levitsky for a lot of interesting information. What would you say, if for this pr we will stick to the current design and this additional functionality, that I could already use in my project? As for general refactoring, I would try to play with it, but metaclass is something new for me and I need more time. I would propose to make it with follow-up pr.
As I said, the metaprogramming idea was just for fun. Thank you for expanding the scope of the work you were willing to do to complete the PR's requirements. I'm happy with the changes.
Metaclasses are an esoteric part of Python that don't get used often, so when I thought of this as a chance to showcase them, it seemed like a good idea. There's more mzTab-specific collection folding code that I think is needed for them to really shine though, so I can address that later.
Thank you all again. I've made some minor changes and as long as everyone is OK with it, I'm ready to merge.
What I changed was to remove the empty string as substitute for missing metadata and add a couple more properties I mentioned in a previous comment.
Good by me. I'm going to re-read the two specs this weekend and generate some fancier method-generating code based on the metaclass example I showed earlier (Inlining the six.add_metaclass
so we don't have to add a dependency).
Thanks for your help. I would like to merge then. By the way, the newbie question, after the branch is merged, when the changes would be available at pypi?
@mobiusklein thank you!
@annalefarova the changes are not available on PyPI until I make a new release and upload it. I usually only do it when I've finished everything that I have started working on, or when a serious issue is fixed; however, it's no problem to make a new release early if it makes things easier for you.
@levitsky Of course I understand, that it can take time. I'm just wondering, maybe you can say approximately how long might it take.
@annalefarova I will be leaving variant
in-place.
@annalefarova I added a test to ensure the variant
property is available, and made a new release.
That was fast =) Thanks a lot
Class MzTab was extended to parse sections from mztab 2.0. The following sections were added: "small molecule feature table" and "small molecule evidence table".