Refactor the internal protocol representation

Here are my goals for the refactoring:

Allow the user to easily switch between different output formats (e.g. text, LaTeX, html, etc.)
Allow protocols to be written in languages other than python, by using JSON instead of pickle to serialize protocols.
Allow plugins to add both new output types and new markup nodes.

I think the best way to do all of this is to take the following approach:

Represent the nodes themselves as simple list/dict data structures. Every node would be a dict that would have at least a "type" key. The rest of the keys could be anything.
Enumerate all the actions that stepwise needs to be able to perform on nodes. Right now, there are just two:
- format
- search and replace using a regular expression
Create a registry of functions that perform these actions for each node type. Plugins would be able to add to this registry, which would enable them to (i) add entirely new node types and/or output formats and (ii) modify the behavior of existing nodes/formats.
- There would need to be a way for plugins to declare prioirities.
- The user should also be able to add to the registry manually.

For what it's worth, I think of this architecture as being heavily inspired by the idea of multiple dispatch.

Some questions:

For the python API, it's nice that paragraph lists and all that are represented by real objects, with useful methods to edit things in place. How would I keep this?
- One approach would be to derive said classes from dict. The downside of that is that it the dict API could easily get in the way of the node API. For instance, both would very likely want to implement __getitem__() in different ways.
- Another approach would be to define a protocol, e.g. __node__(), that is responsible for returning a dict. That way, a protocol could store a mix of dicts and objects implementing this protocol, and convert everything to dicts at the last moment. Note that changes to the returned dict would have to be reflected in the object (i.e. the dict would have to be the true source of data) so that edit-in-place functions like replace() could work.
- It's important that the python API be compatible with plain dictionaries (in addition to these fancy objects), because there could be custom node types that don't have wrappers.
How would footnotes work?
- Right now I manage using replace().
- The problem is that I don't have the concept of inline formatting, which is what footnotes are. I could add that concept though. Basically this would just mean making paragraph nodes lists of inline nodes.
- I'd also need to add some sort of shared state to the formatting process, but that's not a big deal; just a state={} argument.
- It might also be nice to provide each node with a list of its parent nodes, e.g. so inline nodes could complain if used outside of a paragraph.
- This would make the python API more verbose (and would break a lot of existing scripts).
  - Right now I can just say pr += "Do this [1]." and define 1 later.
  - Instead, I'll have to do something like pr += p("Do this ", footnote("Important note"), ".")
  - Better syntax: pr += p("Do this [].", "Important note"). Basically, use '[]'as a replacement string and any arguments after the first as footnotes. I could even expand the syntax to allow different inline types, if it comes to that.
  - I could even just parse the string as markdown, e.g. pr += "Do this ^[Important note]". That's probably the best solution.
  - Unfortunately, CommonMark doesn't include the inline footnote syntax I used here, but it seems like a common extension and not something that would be hard to add myself.
- I think adding the concept of inline nodes is the way to go. The current approach is fragile anyways, and geared towards not really having to parse text files. The proposed approach is much more semantic.
- Note that even though the footnotes themselves would be inline, they can contain markup that's not (e.g. paragraphs, tables, etc.). I'll have to make sure that's supported.
Is it worth finding an existing library to do all this?
- I don't think so. Formatting documents is pretty close to the core functionality that stepwise provides, so in order to use a third-party library it'd have to be a perfect fit. In particular, the abilities to serialize to JSON and to replace text in-place seem like features that wouldn't be present in a general purpose library. I haven't looked super hard, though.

Some miscellaneous notes:

I'd have to drop support for text files, and replace it with support for markdown files.
It might be nice for nodes to have fallback types. For example, I'm thinking that a reaction node could basically be a table, but with the ability to be formatted more nicely if the format knows how. Basically inheritance for nodes.
stepwise.Protocol() will become a pretty thin wrapper around these formatting nodes (akin to pl, really). It won't really be important for anything other than making it easier to manipulate protocols from python. It will probably also implement __node__().
Marko looks like a great library for parsing markdown.

Here's some pseudocode:

# Example document
root = {
    type: 'pl',
    items: [
        {
            type: 'p',
            content: 'Lorem ipsum...',
        }, 
        {
            type: 'ul',
            items: [
                {
                    type: 'p',
                    content: 'Dolor sit amet...',
                },
            ],
        },
    ],
}

# Example plugin functions
def format_p_text(node):
    textwrap.wrap(node['contents'])

def format_pl_text(node):
    return '\n\n'.join(
            format(item, 'text')
            for item in node['items']
    )

def replace_p(node, pat, repl, count):
    node['content'], n = re.subn(pat, repl, count)
    return n

def replace_pl(node, pat, repl, count):
    for item in node['items']:
        n = replace(node, pat, repl, count)
        count -= n

# The actual registries:
FORMAT = {
        ('text', 'p'): format_p_text,
        ('text', 'pl'): format_p_text,
}
REPLACE = {
        'p': replace_p,
        'pl': replace_pl,
}

# The generic functions:
def format(node, format):
    try:
        node = node.__node__()
    except AttributeError:
        pass

    key = node['type'], format
    func = FORMAT[key]
    return func(node)

def replace(node, pat, repl, count):
    try:
        node = node.__node__()
    except AttributeError:
        pass

    # This works in-place, so no need to return anything.
    type = node['type']
    func = REPLACE[type]
    func(node)

kalekundert / stepwise

Refactor the internal protocol representation #58