levitsky / pyteomics

Pyteomics is a collection of lightweight and handy tools for Python that help to handle various sorts of proteomics data. Pyteomics provides a growing set of modules to facilitate the most common tasks in proteomics data analysis.
http://pyteomics.readthedocs.io
Apache License 2.0
105 stars 34 forks source link

Add automatic generation of metadata accessors #23

Closed mobiusklein closed 3 years ago

mobiusklein commented 3 years ago

Implements the metaclass-based property generator and collection folding logic.

There are almost enough properties generated to justify the amount of extra metaprogramming code added.

levitsky commented 3 years ago

Wow, this is a lot more (and sooner) than I expected! Thank you!

This is not easy to soak in, to be honest, but I'll do my best to work through this code to appreciate the beauty of it. It would definitely help if you could briefly demonstrate all of the magic the parser does in terms of (meta)data access, though, especially with path parsing. I want to document it so others can enjoy it as well.

One question that I've already had since the previous discussion is this: do we want to collect lists of metadata from all base classes? That way no subclasses can ever remove any metadata, is that right? I do realize that having to repeat the list in subclasses would be undesirable.

levitsky commented 3 years ago

Thank you for extending the documentation. Apparently I misunderstood the code yesterday and thought that extract_path and related code was supposed to handle some user-supplied queries, while in fact it serves for collapsing of metadata into groups of collections.

I'm ready to merge as is, but I'm curious about one thing: if we traverse all base classes to collect metadata specifications, does it mean that there is no straightforward way to remove any metadata in subclasses? This is not a practical concern, though.

mobiusklein commented 3 years ago

I've added a bit more documentation to the metaclass and added explicit override support. To prevent a metadata property from being added, you can explicitly set the name to some value or define a method/property with the same name.

class MyMzTab(mztab.MzTab):
      title = "foo"

This will not get a title property automatically generated. However, this removal isn't heritable. You can also turn off metadata property inheritance completely by setting __inherit_metadata_properties__ = False within the class.

The path parsing is something I'm not quite satisfied with since I implemented it to deal with complex collections in table headings e.g. search_engine_score[1]_ms_run[48] which after gathering would be Group(["search_engine_score", Group([(1, <x>), ..., (48, <y>), ...])]), so that the top-level group could say g.get_path("search_engine_score[1]_ms_run[48]") and get the same answer as saying g['search_engine_score'][1]['ms_run'][48], but it's doesn't mesh well with the current interface.

Furthermore, outside of using the tables in "raw" mode, nobody will get that behavior, just the existing dict or DataFrame interface on the individual tables. I've been pretty dismissive of mzTab in the past because it flattens the structural hierarchy unevenly and inconsistently. This was partially an exercise to reconstruct that hierarchy, but I lack a good example where it's compelling. I would still need to add "column processors", a system by which specific columns register as being parsed by a special function that interprets something specific about their format e.g. PSM's modifications: '6-CHEMMOD:15.9949146221, 8-CHEMMOD:15.9949146221' this column value is not really useful until it's split by lambda x: [(self._cast_value(y[0]), y[1]) for y in x.split(", ")]).

levitsky commented 3 years ago

Indeed, categorizing some data as lists and perhaps even some rich parsing, like for modifications, would be a significant enhancement. I don't have any specific applications for this, though. I guess we can keep it in mind as a direction for future development.

Thank you again for the work. Should I merge now?

mobiusklein commented 3 years ago

I was distracted and didn't document the new raft of properties. This has been added by automatically setting the __doc__ attribute of the properties (not very detailed, but no source to read this from). By the power of autodoc these get noted automatically in the Sphinx documentation.

I wasn't sure how to add this to CHANGELOG, just start a new section?

All set to merge. Probably should have made this a draft.

levitsky commented 3 years ago

The no-members directive didn't seem to work, so class definitions were duplicated in Sphinx. I took the liberty of moving all of the meaningful content into the module docstring, leaving only the boilerplate code in RST. I also added a note in the changelog.

mobiusklein commented 3 years ago

I thought I had tested that option with Sphinx 3.2.1 and it worked locally. It doesn't matter though. Thank you for adding the changelog entry. All set to merge.