tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Add draw_tikz method #791

Open jeromekelleher opened 4 years ago

jeromekelleher commented 4 years ago

Quoted from @grahamgower:

To digress a little, I ended up generating some tikz code to do what I wanted. I'll leave it here in case its useful to someone. Inkscape reads the resulting pdf just fine, and it also seems to be reasonably easy to convert to svg using dvisvgm (see pgfmanual.pdf regarding SVG output).

import string
import textwrap

def draw_tikz(tree, tree_height_scale="rank", aspect=16/9, scale=4, node_size=1):
    """
    Return a string containing latex/tikz commands to draw a tskit.Tree.

    Save the string to a file (e.g. tree.tex) and include it in your latex
    document with ``\input{tree.tex}``.
    """
    template = textwrap.dedent(
        """\
        \\begin{tikzpicture}[
            scale=$scale,
            edge/.style = {
                draw,
                thick,
            },
            node/.style = {
                circle,
                fill=red,
                line width=1,  % border
                % radius
                minimum size=$node_size mm,
                inner sep=0,
            },
            leaftext/.style = {
                below
            },
        ]
            \\foreach \\name/\\x/\\y in {$node_coords}
                \\node[node] (\\name) at (\\x, \\y) {};
            \\foreach \\a/\\b in {$edges}
                \\path[edge] (\\a) |- (\\b);
            \\foreach \\name/\\text in {$leaf_nodes_text}
                \\node[leaftext] at (\\name) {\\text};
        \\end{tikzpicture}\
        """
    )

    root_time = tree.time(tree.root)
    num_leaves = sum(1 for _ in tree.leaves())
    leaf_x_inc = aspect / num_leaves
    leaf_x = 0
    edges = []
    x_coords = dict()
    y_coords = dict()
    for node in tree.nodes(order="postorder"):
        if tree.is_leaf(node):
            x = leaf_x
            leaf_x += leaf_x_inc
        else:
            edges.extend([f"n{u}/n{node}" for u in tree.children(node)])
            children_x = [x_coords[u] for u in tree.children(node)]
            x = (children_x[0] + children_x[-1]) / 2
        x_coords[node] = x
        y_coords[node] = tree.time(node) / root_time

    if tree_height_scale == "rank":
        nodes = sorted(y_coords.keys(), key=y_coords.__getitem__)
        y_coords = {
            node: (i - num_leaves) / num_leaves if i > num_leaves else 0
            for i, node in enumerate(nodes, 1)
        }
    elif tree_height_scale != "time":
        raise ValueError(f"unknown tree_height_scale={tree_height_scale}")

    node_coords = [
        f"n{node}/{x_coords[node]}/{y_coords[node]}" for node in tree.nodes()
    ]

    return string.Template(template).substitute(
        scale=scale,
        node_size=node_size,
        node_coords=", ".join(node_coords),
        edges=", ".join(edges),
        leaf_nodes_text=", ".join(f"n{u}/{u}" for u in tree.leaves()),
    )

if __name__ == "__main__":
    import msprime

    ts = msprime.simulate(10, Ne=1000, random_seed=123)
    tree = ts.first()
    s = draw_tikz(tree)
    print(s)

Originally posted by @grahamgower in https://github.com/tskit-dev/tskit/issues/765#issuecomment-678615456

jeromekelleher commented 4 years ago

This is exciting! Personally, I'd be delighted to skip the SVG->PDF phase and work directly with tex for my typesetting and presentation needs. I skipped doing a tex version because I couldn't bear working with latex picture and nobody else seems to be using asymptote (which I moved to from metapost years ago). Tikz looks like it has good traction in the latex world, so if we have enough enthusiasm and potential users I'd be all for adding support for drawing trees and tree sequences in tikz.

What do you think @grahamgower - would it be much work to generalise this a bit to support some of the options we have in text and svg?

@gtsambos - you're the only other person I know who can tikz - what do you think?

grahamgower commented 4 years ago

It probably wouldn't be too hard to come up with something more general than the code above. But I'm not very famliar with tikz, so the design might needs some more experienced eyes. Tikz actually has extensive tree and graph drawing facilities, which might be more appropriate here. I just did the manual drawing of nodes and edges because my first attempt at using tikz trees looked awful, and it was quicker to use the code above than read the manual properly.

jeromekelleher commented 4 years ago

Cool, thanks @grahamgower. Let's see what @gtsambos thinks.

molpopgen commented 4 years ago

This is exciting! Personally, I'd be delighted to skip the SVG->PDF phase and work directly with tex for my typesetting and presentation needs. I skipped doing a tex version because I couldn't bear working with latex picture and nobody else seems to be using asymptote (which I moved to from metapost years ago). Tikz looks like it has good traction in the latex world, so if we have enough enthusiasm and potential users I'd be all for adding support for drawing trees and tree sequences in tikz.

What do you think @grahamgower - would it be much work to generalise this a bit to support some of the options we have in text and svg?

@gtsambos - you're the only other person I know who can tikz - what do you think?

I use tikz a fair bit, but only simple stuff, and mostly for classes. I'd do more if I took the time to learn the next steps. So that's a 👍 from me.

gtsambos commented 4 years ago

just getting around to this now -- nice work @grahamgower! I'll just throw in a few comments as I think of them

Tikz actually has extensive tree and graph drawing facilities, which might be more appropriate here. I just did the manual drawing of nodes and edges because my first attempt at using tikz trees looked awful, and it was quicker to use the code above than read the manual properly.

this is also how I draw trees with tikz, tbh! It's been a while since I looked into the graph-drawing libraries properly, but from memory, I ended up not using them because they offered less control over the exact placement of nodes than I needed. This control is particularly important when plotting tree sequences, since the edge lengths correspond to times and so on.

The graph/tree libraries are optimised for use cases where you care about qualitative descriptors of the node positions (this node below that one etc), and you want tikz to spread the nodes out in an equally spaced aesthetically pleasing way, like if you were drawing a DAG or something. You can overwrite the defaults to specify particular positions for the nodes, but tbh this is harder than just starting with the nodes and edges in the first place, as you've done here

I'm also in favour of sticking with nodes/edges syntax here as it's 'base tikz', and so less likely to require extra packages installed by the user

gtsambos commented 4 years ago

Here's what I got when I ran your example code -- it's looking really nice! (The screenshot quality isn't doing it justice)

image

gtsambos commented 4 years ago

Some really minor, mostly aesthetic things I'd change are:

gtsambos commented 4 years ago

(^ if you wanted to be really fancy/convenient: you could write the function so that it calls tex from within Python, and then the user won't even have to look at the tex code)

grahamgower commented 4 years ago

Thanks @gtsambos, there are some really good ideas here! I thought about putting the node labels inside the node circle, but this is different to what draw_svg does. Should we match the svg output by default, and have this as an option? I have actually just implemented the svg-style node_labels functionality. I'll submit a minimal pr ASAP, so that we can both work on the same thing.

gtsambos commented 4 years ago

I thought about putting the node labels inside the node circle, but this is different to what draw_svg does.

This is a good point -- so actually, I was thinking that it would be good to have the node IDs as circles because it help to distinguish them from mutations, which are drawn as red dots by draw_svg().

gtsambos commented 4 years ago

I was also gonna suggest that we have a closer look at the draw_svg() code -- I don't know it well, but I suspect that reusing parts of it could help to streamline this. For example, there's a lot of code in your function that's essentially calculating the (x,y) locations of each the nodes, but I'm guessing that this exact same information must have been produces inside the other draw methods as well.

Now that there are a few different plotting methods available, it might be a bit nicer to have a wider class of drawing methods including draw_tikz() and draw_svg() that each use common x, y indices to produce output in different formats. (This would also help to guard against errors/inconsistencies that might occur when one of the methods is changed later, and would just be a useful sanity check)

grahamgower commented 4 years ago

Now that there are a few different plotting methods available, it might be a bit nicer to have a wider class of drawing methods including draw_tikz() and draw_svg() that each use common x, y indices to produce output in different formats.

I completely agree.

gtsambos commented 4 years ago

Should we match the svg output by default, and have this as an option?

This is a question for the wider tskit team, as it's an aesthetic/design preference as much as anything else. @hyanwong? @jeromekelleher? Personally I'm in favour of the node IDs inside the circles simply because it's a neat way to show all of the node IDs on a plot, and that extra information is sometimes useful. But your point about keeping things as consistent as possible between the plotting methods is also important.

gtsambos commented 4 years ago

What do you think @grahamgower - would it be much work to generalise this a bit to support some of the options we have in text and svg?

It probably wouldn't be too hard to come up with something more general than the code above.

To have a fuller opinion on this, I guess it's important to think about which of these other options are most important?

The main two generalisations that I think this would need to become a fully fledged plotting method in tskit are:

  1. support for mutations
  2. support for multiple trees plotted together (either horizontally as in draw_svg(), or vertically one after another as in draw() (I think this might be deprecated now? not sure)

(1) shouldn't be too hard, especially for mutations where you have a time. For other mutations where we only know the edge, we'd need to count up the total number of mutations on a given edge of a given tree, and then space them at equally spaced fractions along the edges between the nodes. This would be a bit annoying but I don't think it would look too different from the code that you've already got to put the internal nodes in the correct positions.

(2) seems relatively straightforward to me as well -- you'd just need to define another set of nodes corresponding to the boundaries of the chromosome where the trees change, and you'd then need to add these coordinates to each of the node coordinates in each consecutive tree. Some of the information that you're currently pulling out of the individual tree might then need to be taken from the tree sequence as a whole (eg. I think you'd need to scale according to the max root time), but that's a relatively minor change too.

In addition to (1) and (2), there are some minor bells and whistles I can think of, eg. colouring nodes according to their population labels, but these would be fairly easy to incorporate as well I think.

Overall I'm excited for this, thanks for getting it started @grahamgower! I'm happy to help with this too, if you'd like it.

jeromekelleher commented 4 years ago

This is great, I'm delighted you're both excited about this!

My take would be to make whatever aesthetic choices are appropriate for the drawing format, and we shouldn't worry at all about keeping things visually consistent between them. Text output looks different to SVG looks different to tikz, and that's fine. If you both like circles around the nodes, then let's have circles around the nodes.

I would push back a little bit on the standalone thing - from my perspective I'd like to embed the tikz directly into a latex doc and not have them as external images that you have to \includegraphics on. I don't want to have to edit the output here to strip off headers/trailers. We can easily add an option, though.

grahamgower commented 4 years ago

Bring on the bells and whistles @gtsambos! If you want to help out, adding support for mutations seems like a logical next step. As for mutation/node colours, I was hoping there'd be some nice pgfkeys wizardry that could help in providing elegant flexibility here. Say if we had a key nodes/ and then node X would apply nodes/nX/.style, but if nodes/nX/.style doesn't exist, then the default node style is applied. But I don't know how to do this (yet), as I'm a pgfkeys noob.

Some other decisions we should make:

jeromekelleher commented 4 years ago

Are we happy to output latex, rather than plain tex? I think for tikz the difference is minimal, so we could probably output plain tex. I don't know if anyone still eshews latex in favour of plain tex these days, but maybe this is useful for ConTeXt users?

I'm pretty happy with latex, tbh. I'm sure if someone wants tex output we can add an option to do so then.

How do we test this? Presumably we want to run pdflatex on it in a CI framework? tex-live is a monster dev dependency though...

We could compare the output against a known good value for a bunch of examples. I.e., we draw some trees and look at the compiled results. Once we're happy with them, we store a bunch of examples in files, and then for the tests we just compare the text output by the two. I don't think we want to start compiling tex as part of our test suite, it's just too brittle across all the different combinations of platforms (and we'd have to put in workarounds for windows).

Is tikz output high-level enough for this to be practical, or will we need to rewrite all the example files every time someone makes a change (which would render it a bit pointless)?

jeromekelleher commented 4 years ago

What does the output tikz code look like for a small example?

grahamgower commented 4 years ago

Small example:

import msprime
ts = msprime.simulate(10, Ne=1000, random_seed=123)
tree = ts.first()
tree.draw_tikz(path="b.tex", node_labels={}, standalone=True)
\documentclass[tikz,border=1mm]{standalone}
\begin{document}

\begin{tikzpicture}[
    scale=2.302585092994046,
    edge/.style = {
        draw=black,
    },
    node/.style = {
        circle,
        fill=black,
        line width=0,  % border
        % radius
        minimum size=2,
        inner sep=0,
    },
    node text/.style = {
        %font=\bfseries,
    },
    left node text/.style = {
        node text,
        above left=.1,
    },
    right node text/.style = {
        node text,
        above right=.1,
    },
    leaf node text/.style = {
        node text,
        below=.1,
    },
    % User customisation goes here. 
]
    \foreach \name/\x/\y in {n18/0.5874999999999999/0.9, n9/0.8999999999999999/0, n17/0.27499999999999997/0.8, n0/0/0, n16/0.5499999999999999/0.7, n14/0.375/0.5, n6/0.5/0, n13/0.25/0.4, n11/0.35000000000000003/0.2, n3/0.30000000000000004/0, n8/0.4/0, n12/0.15000000000000002/0.3, n1/0.1/0, n4/0.2/0, n15/0.7249999999999999/0.6, n5/0.7999999999999999/0, n10/0.6499999999999999/0.1, n2/0.6/0, n7/0.7/0}
        \node[node] (\name) at (\x, \y) {};
    \foreach \a/\b in {n1/n12, n4/n12, n3/n11, n8/n11, n11/n13, n12/n13, n6/n14, n13/n14, n2/n10, n7/n10, n5/n15, n10/n15, n14/n16, n15/n16, n0/n17, n16/n17, n9/n18, n17/n18}
        \path[edge] (\a) |- (\b);
    \foreach \name/\text in {}
        \node[leaf node text] at (\name) {\text};
    \foreach \name/\text in {}
        \node[left node text] at (\name) {\text};
    \foreach \name/\text in {}
        \node[right node text] at (\name) {\text};
\end{tikzpicture}
\end{document}
grahamgower commented 4 years ago

We could compare the output against a known good value for a bunch of examples.

But won't most changes to the code deliberately change the output, rather than just how the output is generated?

hyanwong commented 4 years ago

Should we match the svg output by default, and have this as an option?

This is a question for the wider tskit team, as it's an aesthetic/design preference as much as anything else.

I think it would be easy to create an SVG stylesheet that will produce a layout that matches whatever tikz format you decide upon (e.g. numbers within node circles is trivial, although I'm not sure if that's true once the numbers get to many digits, and need wrapping or whatever).

jeromekelleher commented 4 years ago

But won't most changes to the code deliberately change the output, rather than just how the output is generated?

Well, we'd hope that most changes would be additions that wouldn't affect the existing drawings. I.e., no matter how many options we add, we shouldn't change the basic output you've given here. I dunno though, maybe this isn't practical.

(One minor thing - we should reduce the precision of the output coordinates, there's no point in 14 digits of precision here and it just makes the output harder to read.)

jeromekelleher commented 4 years ago

Or, we could parse the tikz output to make sure it contains the things we think it should. It looks pretty well structured. That combined with a few good known examples that we compare against should give good enough testing.

hyanwong commented 4 years ago

(One minor thing - we should reduce the precision of the output coordinates, there's no point in 14 digits of precision here and it just makes the output harder to read.)

There's a function in the drawing code that @benjeffery wrote to do this. Which brings up an additional point: how much code should we aim to share between SVG drawing and tikz drawing? There may be generalizations (trivially, the precision reduction code) that would be useful to share.

jeromekelleher commented 4 years ago

Which brings up an additional point: how much code should we aim to share between SVG drawing and tikz drawing? There may be generalizations (trivially, the precision reduction code) that would be useful to share.

Probably a fair bit, since they're both working in a continuous coordinate space. The class structure should help, hopefully there could be some superclass of both that does the coordinate calculation. It's not certain though - I ended up making things a lot more complicated than they needed to be by originally sharing code between the text and SVG backends. If it's simpler to just make something standalone, I'd do that. Unless the coordinates that we need in the two end up being identical, then it's probably best to keep them separate.

hyanwong commented 4 years ago

Other high-level ideas to share: does tikz have the concept of grouping elements together? If so, do you want to group the "types" of drawing elements, like node symbols, labels, etc, (as we did in the first SVG iteration) or do you want to use nested grouping to capture the tree structure itself? This is super-powerful in SVG, but perhaps overkill for tikz.

jeromekelleher commented 4 years ago

That's too much I think. The backends are different, trying to share too much between then will just make them all more complicated than they need to be. I wouldn't try to share anything except the coordinate computations.

gtsambos commented 4 years ago

If you want to help out, adding support for mutations seems like a logical next step.

great, I'll look into this! The node/mutation colour stuff can be left till later, I think

hyanwong commented 4 years ago

Yeah, I wasn't meaning sharing any code in this respect, merely whether the idea was a fruitful one to think about. But I agree, it's probably overkill (and a preliminary inspection of tikz doesn't reveal anything in the line of logical grouping).

FYI - the logical groupings in SVG are used to transform the coordinate space so that each subtree starts again with the root node at [0,0] (locally), so unless you wanted to duplicate that idea, then even the coordinate calculations wouldn't be easily ported over from SVG.

jeromekelleher commented 4 years ago

True - probably easiest to just do all the coordinate calcs separately. I think the tree layout stuff is stable enough now that we don't imagine things changing much in future, so having three versions of almost the same thing isn't so bad.

grahamgower commented 4 years ago

I think there might be issues with sharing coordinates between tex and svg implementations. Consider that units of length in tex are not at all like those available in a general purpose programming language. In tex, dimensions are limited to +/- 16383.99999pt. I'm not at all sure how this would affect us, but it means that any intermediate calculations performed inside tex must also not exceed these limits---so this could bite in surprising ways.

Also consider that for svg, width and height parameters in pixels make a lot of sense for a web browser. On the other hand, one rarely talks of pixels in a tex document. Scaling is done to the space available, e.g. like 0.5\textwidth, so I thought it might be simpler to use unit width/height in the tikz code. Probably this is less important for the issue of code sharing though (as the svg drawing code could easily multiply unit lengths by width and height).

gtsambos commented 4 years ago

FYI - the logical groupings in SVG are used to transform the coordinate space so that each subtree starts again with the root node at [0,0] (locally), so unless you wanted to duplicate that idea, then even the coordinate calculations wouldn't be easily ported over from SVG.

True - probably easiest to just do all the coordinate calcs separately.

I think there might be issues with sharing coordinates between tex and svg implementations.

ah, that's a shame, that's fine though! I hadn't dug into the details like this, I was just wondering whether it was possible/practical. Sounds like it isn't

hyanwong commented 4 years ago

FYI - the logical groupings in SVG are used to transform the coordinate space so that each subtree starts again with the root node at [0,0] (locally), so unless you wanted to duplicate that idea, then even the coordinate calculations wouldn't be easily ported over from SVG.

True - probably easiest to just do all the coordinate calcs separately.

I think there might be issues with sharing coordinates between tex and svg implementations.

ah, that's a shame, that's fine though! I hadn't dug into the details like this, I was just wondering whether it was possible/practical. Sounds like it isn't

If you did want to use the same principle, I guess you could keep hold of a stack of transforms within python (they are only ever x/y translations), and apply the transformation stack before outputting the coords to tikz. Sounds complicated, but it's possible you might end up doing something like this anyway, without realising it!

grahamgower commented 4 years ago

Giving each tree in a sequence its own set of coordinates sounds logical, and simpler than the alternative. I don't see any reason we couldn't do that and just output one tikz \pic per tree.

hyanwong commented 4 years ago

Giving each tree in a sequence its own set of coordinates sounds logical, and simpler than the alternative. I don't see any reason we couldn't do that and just output one tikz \pic per tree.

No, sorry, I meant subtree. So each internal node within a single tree has a simple x/y translation applied to it, and this is applied iteratively, in a nested way down the tree.

grahamgower commented 4 years ago

Ah, ok. Sounds like I need to go through and grok the svg code.

hyanwong commented 4 years ago

But actually, looking at the code again, we do define universal x,y coordinates for nodes in a tree (saved in node_x_coord_map and node_y_coord_map, but then we translate those into successive transformations when outputting to SVG. So there is some possibility of sharing the code for coordinate calculation here.

Some of the complexities I've had to deal with just now are what to do when there are mutations above the root node (do we add an extra branch above the root to place the mutations), and how to plot the X axis in a tree sequence, so that we have an idea of how much genome each tree covers (see e.g. https://github.com/tskit-dev/tskit/raw/master/python/tests/data/svg/ts.svg ).

If you want to play with the SVG code, then addressing https://github.com/tskit-dev/tskit/issues/580 would scratch an itch, and you might want something like that in the tikz code anyway. But it might be too distracting! Getting a workable tikz implementation seems more of a priority.

hyanwong commented 4 years ago

How much of this "provide a visual output in format X" should go in the https://github.com/tskit-dev/tsviz repo, and how much should be base tskit, do you think @jeromekelleher ?

petrelharp commented 4 years ago

My inclination is towards tsviz, since it'd be nice if e.g. breakages in someone's latex installation didn't prevent them from running the tskit tests.

hyanwong commented 4 years ago

It would also mean that it won't be too bad to include some minimal TeX installation as a dev dependency, I suppose.

gtsambos commented 4 years ago

My inclination is towards tsviz, since it'd be nice if e.g. breakages in someone's latex installation didn't prevent them from running the tskit tests.

I don't think the tskit tests will involve latex, I think we'll just be parsing the .tex output string and comparing it against a known output/list of features.

Btw, this is an active PR now, feel free to look at progress over at #798 (sorry I've been a bit slow on my end @grahamgower !)

grahamgower commented 4 years ago

Even if we decided that a TeX installation is an ok dev requirement, it's is probably not a reasonable CI requirement. Unless "apt-get install texlive" completes on a CI machine in reasonable time? (i have low expectations!).

I'm agnostic about where draw_tikz ends up. It's really easy to call import foo; foo.draw_tikz(ts, ...) instead of ts.draw_tikz(...). The bigger caveat I think, would be that an external home for draw_tikz makes it less likely we'll merge common code between draw_tikz and draw_svg.

petrelharp commented 4 years ago

I don't think the tskit tests will involve latex, I think we'll just be parsing the .tex output string and comparing it against a known output/list of features.

Ah, ok. That makes sense.

The bigger caveat I think, would be that an external home for draw_tikz makes it less likely we'll merge common code between draw_tikz and draw_svg.

That's a very good point. Perhaps we should leave it in tskit. (up to you all, really)

hyanwong commented 4 years ago

The bigger caveat I think, would be that an external home for draw_tikz makes it less likely we'll merge common code between draw_tikz and draw_svg.

That's a very good point. Perhaps we should leave it in tskit. (up to you all, really)

We could make the common code accessible from tskit for use in tsviz, I suppose? Perhaps not a "public" API, but something for internal use? it might force us to think about what we want to expose, in case anyone wants to create another output format.

jeromekelleher commented 4 years ago

I think we should keep it simple here. Tsviz isn't a packaged repository, and setting it up/bringing it up to production quality would be a whole pile of work. The rule I've been working with recently for supported viz/exporting to formats is that we keep things within tskit so long as we don't incur any extra dependencies. So, outputting tikz code is fine because it's a just a bit of text. Writing out PNGs, on the other hand, is clearly out because we'd have to depend on some heavyweight library to do this.

Testing is an issue, here. I agree we shouldn't make tex a development or CI dependency. If we can't test effectively without compiling the tex, then maybe we have to consider making it third-party (but I hope we don't).

hyanwong commented 4 years ago

The rule I've been working with recently for supported viz/exporting to formats is that we keep things within tskit so long as we don't incur any extra dependencies. So, outputting tikz code is fine because it's a just a bit of text.

Sounds sensible. Eventually, however, I would rather like a generic way of calling the drawing routines, so that it's easy to add another using the same convention. For example, ts.draw(tskit.drawing.Svg, **kwargs) could be an alternative convention to ts.draw_svg(**kwargs). Similarly, ts.draw_tikz(**kwargs) == ts.draw(tskit.drawing.Tikz, **kwargs). That way we can easily plug in something from tsviz, e.g. ts.draw(tsviz.Png, **kwargs), or whatever formats we eventually support.

ts.draw() is already used, so if we don't want to obsolete this, we could have ts.plot() or something similar.

benjeffery commented 4 years ago

For my two cents: This belongs in tskit, as it is something we want to support and maintain. Putting it anywhere else also reduces discover-ability.

Testing hurdles are surmountable - for example if we really need to compile the tex, only run it on CI by default. CI can use a docker container to avoid a lengthy texlive install.