[DISCUSSION] User defined modules: refining the way modules are defined?

This is a followup to #1872, but I am writing it here to start a separate discussion on this.

Background: now anvi'o has a way to help people define their own metabolic modules, the question is how to help them do it without too much pain and suffering. Iva already started a conversation through #1872, in which she proposes a new script to make this process easier. It is indeed a very good idea to help people generate user-defined modules with software help, but I am not sure if it is reasonable to expect people to be able to type something like this in the command line with comfort (and without making mistakes):

anvi-script-gen-user-module-file -I "UD0023" \
                                 -e enzymes.txt \
                                 -n "Frankenstein pathway for demo purposes" \
                                 -c "User modules; Demo set; Frankenstein metabolism" \
                                 -d "K01657+K01658 PF06603.14,(COG1362 TIGR01709.2)"

This is all to turn something logical into the illogical form of KEGG definitions. But I think there is no advantage of KEGG's format since it will inevitably go extinct. More specifically, I think perhaps the revolution Iva has started to democratize metabolic module definitions should have started at the level of 'how to define metabolic modules'.

What if we define a new way to describe modules that does not require defining modules as some silly text files with arbitrary number of fields that are unfriendly to both humans and computers? What if we come up with a new format using a YAML template?

OK. Here is an example user-defined module from #1872:

ENTRY       UD0023
NAME        Frankenstein pathway for demo purposes
DEFINITION  K01657+K01658 PF06603.14,(COG1362 TIGR01709.2)
ORTHOLOGY   K01657  anthranilate synthase component I [EC:4.1.3.27]
            K01658  anthranilate synthase component II [EC:4.1.3.27]
            PF06603.14  UpxZ
            COG1362  Aspartyl aminopeptidase
            TIGR01709.2  type II secretion system protein GspL
CLASS       User modules; Demo set; Frankenstein metabolism
ANNOTATION_SOURCE  K01657  KOfam
                   K01658  KOfam
                   PF06603.14  METABOLISM_HMM
                   COG1362  COG20_FUNCTION
                   TIGR01709.2  TIGRFAM
\\\

The following is a YAML template that we could use to define the same thing:

(...)
UD0023:
  name: Frankenstein pathway for demo purposes
  version: 1
  author:
    name: Iva Veseli
    email: iva@veseli.org
    url: https://merenlab.org/
    citation: Veseli et al. (2022), doi:xxx
  class:
    first: User modules
    second: Demo set
    third: Frankenstein metabolism
  steps:
    - (K01657 or K01658) and PF06603.14
    - COG1362 or TIGR01709.2
  enzymes:
    K01657:
      description: anthranilate synthase component I
      source: KOfam
      EC-number: 4.1.3.27
    K01658:
      description: anthranilate synthase component II
      source: KOfam
      EC-number: 4.1.3.27
    PF06603.14:
      description: UpxZ
      source: METABOLISM_HMM
    COG1362:
      description: Aspartyl aminopeptidase
      source: COG20_FUNCTION
    TIGR01709.2:
      description: type II secretion system protein GspL
      source: TIGRFAM
(...)

Why?

We can add as much information as we want for a given module (such as author contact info, version, or source specific additions to enzymes such as enzyme-commission by KEGG), etc) to make individual or collections of modules self-descriptive and accessible.
With a YAML template like this, sanity checking of modules would be easy, syntax to define steps would be easy, describing the sources and accession ids of each enzyme would be easy, documenting the format would be easy. and so on.
It is not only more human-readable and editable, but also more parsable. Any programming language can parse YAML files. It would also make visualization and automatic module generation easier.

For this to work, we would first need to update the codebase in such a way that anvi-setup-kegg-kofams would turn every KEGG format into our definition, and then we can extend those definitions by asking users to define their own modules the same way.

For instance, this is how that particular Frankenstein module looks like to Python (easier on your eyes, easier on Python):

python -c "from anvio.utils import get_yaml_as_dict as g; import json; print(json.dumps(g('test-module.yaml'), indent=2))"

{
  "UD0023": {
    "name": "Frankenstein pathway for demo purposes",
    "version": 1,
    "author": {
      "name": "Iva Veseli",
      "email": "iva@veseli.org",
      "url": "https://merenlab.org/metabolic-modules/",
      "citation": "Veseli et al. (2022), doi:xxx"
    },
    "class": {
      "first": "User modules",
      "second": "Demo set",
      "third": "Frankenstein metabolism"
    },
    "steps": [
      "(K01657 or K01658) and PF06603.14",
      "COG1362 or TIGR01709.2"
    ],
    "enzymes": {
      "K01657": {
        "description": "anthranilate synthase component I",
        "source": "KOfam",
        "EC-number": "4.1.3.27"
      },
      "K01658": {
        "description": "anthranilate synthase component II",
        "source": "KOfam",
        "EC-number": "4.1.3.27"
      },
      "PF06603.14": {
        "description": "UpxZ",
        "source": "METABOLISM_HMM"
      },
      "COG1362": {
        "description": "Aspartyl aminopeptidase",
        "source": "COG20_FUNCTION"
      },
      "TIGR01709.2": {
        "description": "type II secretion system protein GspL",
        "source": "TIGRFAM"
      }
    }
  }
}

What do you all think?

Best wishes,

From my layman point of view, the YAML template is much more comprehensible and therefore easuer to follow, although it looks longer on the first view. Just filling in gaps in a template that actually makes sense is I think much less error prone. I would prefer that version if it is feasible to implement.

I think it's a good idea to generalize the module format like this, allowing additional fields to be added but still being able to convert back to a KEGG module entry. Here are a few advantages that I can think of.

It will make more sense to users that modules defined from sources beside KEGG are not shoehorned into a KEGG-formatted file.
In the case of "signature" modules, such as an RNA polymerase module defined by RNAP subunits, the "steps" field can be made optional or another field can be added to specify that the module is a signature rather than a pathway.
The "enzymes" field can be generalized to "genes" or even "sequence units," since one can imagine having promoters identified by some tool in a module with genes.
A field could be added for "parent" and "child" modules. For example, Pathway B may be a section of Pathway A. This is a situation that often arises even in KEGG modules: "M00002 Glycolysis, core module involving three-carbon compounds" is the latter section of "M00001 Glycolysis (Embden-Meyerhof pathway)", but there is currently no indication in the module definitions that this is the case. The default value of the parent and child fields for KEGG modules would be "unknown" (unless we write a program to parse this from each module).
Since multiple sources can be used in user-defined modules, gene annotations from multiple sources can be required for a step to increase confidence in its assignment. For example, step one may require a gene to be called with a certain KOfam and COG annotation.
A program will generate a default template, and any changes in the template format will use a migration script for existing templates.

The obvious downside is that @ivagljiva will have to invest yet more time in changing everything, including her detailed artifacts, when no one even uses this yet.

I also think this improvement does not really change the amount of user effort that goes into creating the module: 95+% of the effort is in finding exactly which genes should go in the module in the first place.

I firmly agree that this is the future we are (and should be) headed towards. The biggest advantage I see is that the YAML format is more extensible, as has been discussed, so that we can adapt it to the evolving needs of the user community and of the estimation algorithm itself. I could talk a lot about how I agree with the above points, but I think it will be more efficient if I instead mention my concerns :)

My first concern is that I don't feel qualified to make assumptions about what is important to users in terms of defining modules. As someone who has always relied on KEGG pathway definitions and exclusively worked with metabolism on a theoretical level, I don't know enough about the process (yet?) to see all of the missed opportunities in the current format that we could solve while re-designing the module definition strategy. That is why I am very glad to see the support and ideas from @jessika-fuessel and @semiller10 on this topic.

But I feel that we should hear from even more people, who work with metabolic pathways in different contexts, to gain a consensus on what is worth adding (or taking away) from the current strategy of defining modules. For example, is the three-tiered CLASS value from KEGG actually useful and something we should keep? I've mostly found myself ignoring all but the first category, but perhaps others would disagree. Should we include information about expected operon structure, which is something I personally am very interested in but perhaps does not apply to enough pathways to make it a worthwhile addition?

User-defined metabolism is a very new feature, so there hasn't been much community feedback about it yet. Perhaps that makes this an ideal time to make these sorts of changes (before everyone gets used to the current way of doing things), but I worry that rushing to implementing a new strategy (though it looks foolproof to my naive programmer's eyes) would cause us to miss opportunities for even more improvement or smarter design. This change will take some effort to implement and will influence the breadth of possible future use-cases and features of the metabolism reconstruction code, so I would like to design this thoughtfully and do it right the first time :)

That brings me to my second concern, which is that it will not be so simple to update the codebase for this change. Since I (perhaps wrongly) relied exclusively on the KEGG format from the get-go, there are little nuances in the code that were written with this format in mind, and would have to be re-written. For instance, we unroll module definitions into all possible paths through the module by relying on KEGG-formatted DEFINITION strings. Another example is the information that gets included in the output files - metadata like CLASS values, products and substrates are in there simply because those are the kinds of data that KEGG chose to include in their module files (which circles back to my question of 'what is even important here'). The central estimation algorithm would likely not have to change, and indeed it will probably be easier to write new code based on this smarter format, but finding and re-working all the little spots in the code to work with the new format will take some time.

In short, I am enthusiastic about the change, but hesitant to start implementing it. I think the question is not if we should start on this new strategy, but when to do it. I could do it now, which has the advantage of least disrupting future users who would have to otherwise re-learn the definition strategy, but the downside is that I may not do it the best way. Or I could wait to hear from people who are starting to use this tool on real data and incorporate their suggestions and ideas, but then this risks getting delayed for a long time because it will take time to test it out, set up meetings, synthesize ideas from multiple parties, etc.

So I would like to hear what the rest of you think about this, especially the when question :)

After thinking about this all day, I have my own opinion about it, which is that I could start these changes now, but only implement the change from KEGG's text file format to YAML, without adding or modifying which data values are included or not. We've been constrained by KEGG for so long that perhaps it won't matter if we stick to it for a while longer, and the YAML should be extensible enough that when people start getting back to us with feedback the future changes will be incremental and relatively straightforward. But this strategy does run the risk of me making some stupid decision on a central aspect of the design (example: how should we distinguish required enzymes from non-required enzymes? enzyme components from enzymes that do not form a complex?) and then having to either be limited by that stupid decision or re-do the whole thing again at some point in the future.

Thoughts? :)

To address some of the concerns raised by @ivagljiva, the YAML file could include "metadata" sections for different possible sources. The YAML file is currently based on KEGG, so it has "class" and "steps" fields, but these could be nested in a "KEGG" metadata field.

It is important to retain backward compatibility to KEGG, especially since KEGG module data should be able to be automatically turned into one of these YAML files. So I would not start removing aspects of "class" or "steps," though as I argued before, it might be worth renaming "steps" in the YAML file, since the term doesn't make sense for a KEGG signature module.

Of course, the when question is very important. I don't have a firm personal answer to that. But there are a few points I'd like to make:

My first concern is that I don't feel qualified to make assumptions about what is important to users in terms of defining modules

In the worst case scenario, YAML recipes will be as good and as comprehensive as the current recipes used by KEGG. Turning it into YAML would give an extensible framework where user input would find its place to change the evolving format. Currently we can't even see how to improve user defined modules because we're stuck with the KEGG definitions that are limiting.

I could do it now, which has the advantage of least disrupting future users who would have to otherwise re-learn the definition strategy, but the downside is that I may not do it the best way

Sadly we are never able to do things the best possible way, and it would be unfair to expect that from ourselves :) BUT, the advantages of introducing user-defined modules with an easier to learn, easier to use, and within a framework that is extensible are two folds: (1) being able to collect new ideas and quickly integrate them by extending the existing framework (since re-organization of fields or structure of YAML will be much more easier than being able to do it with KEGG module descriptions), and (2) being able to say the anarchy (or democracy, if you like that better) comes with considerable amount of novelty :)

to @semiller10's point on

it might be worth renaming "steps" in the YAML file, since the term doesn't make sense for a KEGG signature module.

I agree. But we can add a new keyword, like this one:

UD0023:
  name: Frankenstein pathway for demo purposes
  type: signature
  version: 1
  author:
    name: Iva Veseli
    email: iva@veseli.org
    url: https://merenlab.org/
    citation: Veseli et al. (2022), doi:xxx
  (...)

And the type could be signature, standard, multistep, or whatever someone may come up with to link it to a different functional route in the class later :)

My 2 cents, and I thank everyone for their time to participate to this exchange.

As user, YAML would be easier to write/implement if I wanted to make my own custom pathway.

Regarding the naming conventions, each "step" in the pathway modules are called "reactions", but in addition to the enzymes involved, they also indicate the compounds, that's why they can link up different modules in the pathway maps.

About the types, KEGG has currently 2 module types that I'm aware: "pathway" and "signature" modules. Pathway modules are the usual ones we are all familiar with, but the signature modules can be defined as collections of KO or collections of other modules. e.g. the "acetogen" module (M00618) is defined as the combination of "Reductive acetyl-CoA pathway (Wood-Ljungdahl pathway)" (M00377) and the "Phosphate acetyltransferase-acetate kinase pathway, acetyl-CoA => acetate" (M00579). Following the BRITE hierarchy might be the easiest way forward regarding "types" esp. if users consider sharing the module definitions.

Very interesting discussion.

So I would like to hear what the rest of you think about this, especially the when question :)

As an outsider, relatively speaking, I agree that this has to happen at some point because it's just too good not to be a thing. And I think the best time to do it is now.

But this strategy does run the risk of me making some stupid decision on a central aspect of the design (example: how should we distinguish required enzymes from non-required enzymes? enzyme components from enzymes that do not form a complex?) and then having to either be limited by that stupid decision or re-do the whole thing again at some point in the future.

I wouldn't be too concerned about nailing the YAML format. But since we anticipate/know that the YAML format will evolve and change over time, perhaps dramatically, there is something I think you should worry about.

In my opinion, the only code that should ever know anything about the structure of the YAML file should be a single class. All applications should write/read/query using get/set/write/save/load methods in this class. You completely restructure the YAML file and how steps are defined? No problem, because after you rewrite get_steps(), the 10 places where get_steps() is called have no idea that any change even occurred. This will make inevitable refactors a cake walk.

I would be happy to work with you on the design.

Just my two cents on the YAML format (from someone who as parsed KEGG files in the past to get information on metabolites), this is a great idea. Frankly, anything is an improvement over having to parse out the KEGG file.

merenlab / anvio

[DISCUSSION] User defined modules: refining the way modules are defined? #1873