linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Add treemap plot #43

Closed JobLeonard closed 7 years ago

JobLeonard commented 8 years ago

As mentioned in #15:

Paper on how to make squarified treemaps:

http://www.win.tue.nl/~vanwijk/stm.pdf

Found on the page for this jQuery based implementation:

http://jstreemap.com/

Recharts also has a react-implementation of a treemap, but it sadly comes with too much baggage to be directly usable:

http://recharts.org/examples/#CustomContentTreemap

However, it has made me think about how to implement the data structure underlying the treemap.

Here is how Recharts does it. It is an easy-to-read but not very efficient lay-out of the data:

const data = [
          {
            name: 'data',
            children: [
              { name: 'Data', size: 20544 },
              { name: 'DataList', size: 19788 },
              { name: 'DataSprite', size: 10349 },
              { name: 'EdgeSprite', size: 3301 },
              { name: 'NodeSprite', size: 19382 },
              {
                name: 'render',
                children: [
                  { name: 'ArrowType', size: 698 },
                  { name: 'EdgeRenderer', size: 5569 },
                  { name: 'IRenderer', size: 353 },
                  { name: 'ShapeRenderer', size: 2247 },
                ],
              },
              { name: 'ScaleBinding', size: 11275 },
              { name: 'Tree', size: 7147 },
              { name: 'TreeBuilder', size: 9930 },
            ],
          },
        ];

The tree is built up out of an array of objects (just one in this example), which can contain more arrays of objects.

Each object has a name, and either a size or an array of children; if no size is present it's calculated by looping over the children. The treemap is built up recursively that way.

This is intuitive, but what's more efficient is to have an objects of arrays:

const data = {
          names: [ 'data' ],
          sizes: [ 110583 ],
          children: [ 
            {
              names: [
                'Data', 'DataList', 'DataSprite',
                'EdgeSprite', 'NodeSprite', 'render',
                'ScaleBinding', 'Tree', 'TreeBuilder',
              ],
              sizes: [
                20544, 19788, 10349,
                3301, 19382,  8867 ,
                11275, 7147, 9930,
              ],
              children: [ 
                null, null, null,
                null, null, {
                  names: [
                    'ArrowType', 'EdgeRenderer', 
                    'IRenderer', 'ShapeRenderer',
                  ],
                  sizes: [
                    698, 5569, 353, 2247,
                  ],
                  children: null,
                },
                null, null, null,
              ],
            },
          ],
        };

It may not be directly obvious why this is more efficient. Compare looking up the names of children before and after. Before, we had to:

In the second version, we:

Aside from the initial array dereferencing, we skip an object dereference for per child.

Same with the sizes array. First of all, there's no reason to calculate the size for the nodes every time, since the data is fixed anyway. Just calculate it once and store it in the sizes array; if we want to display two nesting levels of a tree that goes ten levels deep that saves us pointless recursion. Second of all, JS engines are smart enough to detect if an array has only integers, and creates a fast, dense array. This is even more memory- and cache friendly. Heck, we could even go nuts and use a TypedArray.

Another benefit of this split is that quite often we only care about the sizes. The output of Recharts' treemap itself is actually a good illustration of this:

image

We don't care about the names of deeply nested objects! With this set-up we can just loop over the sizes array to fill a square and then for each sub-tile recurse over children.

(aside: this treemap is trying to pack its contents as squares, which has all kinds of issues, as explained in the paper by Van Wijk)

That leaves the children array. Again, the benefit of separating this from the size array is that we only need to loop over it if we want to display deeper levels.

However, for this recursion we still need to check for every entry if it's a leaf or a node. Applying the same optimisation trick as before to separate leaves and nodes fixes that:

const data = {
          nodeNames: [ 'data' ],
          nodeSizes: [ 110583 ],
          nodeChildren: [ 
            {
              leafNames: [
                'Data', 'DataList', 'DataSprite',
                'EdgeSprite', 'NodeSprite', 'render',
                'ScaleBinding', 'Tree', 'TreeBuilder',
              ],
              leafSizes: [
                20544, 19788, 10349,
                3301, 19382,  8867 ,
                11275, 7147, 9930,
              ],
              nodeNames: [ 'render' ],
              nodeSizes: [ 8867 ],
              nodeChildren: [ 
                {
                  leafNames: [
                    'ArrowType', 'EdgeRenderer', 
                    'IRenderer', 'ShapeRenderer',
                  ],
                  leafSizes: [
                    698, 5569, 353, 2247,
                  ]
                },
              ],
            },
          ],
        };

Admittedly, neither of these are as human-readable compared to the first data structure. However, it's more efficient for computers. Also, it's not going to be made by humans, it's going to be computer-generated.

Note that the examples given here do not show off the benefits best - these are bigger when there are more leaves and children for every node. However, it should be noted the worst case scenario for this scheme (one child per node, many levels deep) no worse than the original scheme, and every other scenario is better.

BONUS CONTENT: my shitty sketches of these three schemes

2016-08-13 18 04 25-2

2016-08-13 18 04 38-2

2016-08-13 18 07 55-2

slinnarsson commented 8 years ago

One thing to remember is that I would like the tiles to not be simply uniformly colored, but to actually show single cells. That is, if there’s a tile deep down which represents 200 cells, it might be a 20x10 pixel rectangle (or, say 40x20) and each pixel would have an intensity given by the expression of some gene in those cells.

Recall that we have an expression matrix (loom file) with ~25,000 rows (genes) and, say, 200,000 columns (cells). There would be a total of 25,000 different treemaps, one per gene. The layout would be the same for all of them, but the pixel intensities would be different. So one challenge is how to (efficiently) take a single row of data from the matrix and paint it onto the tiles of the treemap, while still being able to zoom in and out by clicking on tiles. One solution would be to have off-canvas bitmaps for each tile, paint them with data for the given gene, then draw them onto the canvas.

On 13 Aug 2016, at 18:31, Job van der Zwan notifications@github.com<mailto:notifications@github.com> wrote:

As mentioned in #15https://github.com/linnarsson-lab/Loom/issues/15:

Paper on how to make squarified treemaps:

http://www.win.tue.nl/~vanwijk/stm.pdfhttp://www.win.tue.nl/%7Evanwijk/stm.pdf

Found on the page for this jQuery based implementation:

http://jstreemap.com/

Recharts also has a react-implementation of a treemap, but it sadly comes with too much baggage to be directly usable:

http://recharts.org/examples/#CustomContentTreemap

However, it has made me think about how to implement the data structure underlying the treemap.

Here is how Recharts does it. It is an easy-to-read but not very efficient lay-out of the data:

const data = [ { name: 'data', children: [ { name: 'Data', size: 20544 }, { name: 'DataList', size: 19788 }, { name: 'DataSprite', size: 10349 }, { name: 'EdgeSprite', size: 3301 }, { name: 'NodeSprite', size: 19382 }, { name: 'render', children: [ { name: 'ArrowType', size: 698 }, { name: 'EdgeRenderer', size: 5569 }, { name: 'IRenderer', size: 353 }, { name: 'ShapeRenderer', size: 2247 }, ], }, { name: 'ScaleBinding', size: 11275 }, { name: 'Tree', size: 7147 }, { name: 'TreeBuilder', size: 9930 }, ], }, ];

The tree is built up out of an array of objects (just one in this example), which can contain more arrays of objects.

Each object has a name, and either a size or an array of children; if no size is present it's calculated by looping over the children. The treemap is built up recursively that way.

This is intuitive, but what's more efficient is to have an objects of arrays:

const data = { names: [ 'data' ], sizes: [ 110583 ], children: [ { names: [ 'Data', 'DataList', 'DataSprite', 'EdgeSprite', 'NodeSprite', 'render', 'ScaleBinding', 'Tree', 'TreeBuilder', ], sizes: [ 20544, 19788, 10349, 3301, 19382, 8867 , 11275, 7147, 9930, ], children: [ null, null, null, null, null, { names: [ 'ArrowType', 'EdgeRenderer', 'IRenderer', 'ShapeRenderer', ], sizes: [ 698, 5569, 353, 2247, ], children: null, }, null, null, null, ], }, ], };

It may not be directly obvious why this is more efficient. Compare looking up the names of children before and after. Before, we had to:

In the second version, we:

Aside from the initial array dereferencing, we skip an object dereference for per child.

Same with the sizes array. First of all, there's no reason to calculate the size for the nodes every time, since the data is fixed anyway. Just calculate it once and store it in the sizes array; if we want to display two nesting levels of a tree that goes ten levels deep that saves us pointless recursion. Second of all, JS engines are smart enough to detect if an array has only integers, and creates a fast, dense array. This is even more memory- and cache friendly. Heck, we could even go nuts and use a TypedArray.

Another benefit of this split is that quite often we only care about the sizes. The output of Recharts' treemap itself is actually a good illustration of this:

[image]https://cloud.githubusercontent.com/assets/259840/17644150/6df3ab7a-617f-11e6-8f69-3bf849b03e29.png

We don't care about the names of deeply nested objects! With this set-up we can just loop over the sizes array to fill a square and then for each sub-tile recurse over children.

(aside: this treemap is trying to pack its contents as squares, which has all kinds of issues, as explained in the paper by Van Wijk)

That leaves the children array. Again, the benefit of separating this from the size array is that we only need to loop over it if we want to display deeper levels.

However, for this recursion we still need to check for every entry if it's a leaf or a node. Applying the same optimisation trick as before to separate leaves and nodes fixes that:

const data = { nodeNames: [ 'data' ], nodeSizes: [ 110583 ], nodeChildren: [ { leafNames: [ 'Data', 'DataList', 'DataSprite', 'EdgeSprite', 'NodeSprite', 'render', 'ScaleBinding', 'Tree', 'TreeBuilder', ], leafSizes: [ 20544, 19788, 10349, 3301, 19382, 8867 , 11275, 7147, 9930, ], nodeNames: [ 'render' ], nodeSizes: [ 8867 ], nodeChildren: [ { leafNames: [ 'ArrowType', 'EdgeRenderer', 'IRenderer', 'ShapeRenderer', ], leafSizes: [ 698, 5569, 353, 2247, ] }, ], }, ], };

Admittedly, neither of these are as human-readable compared to the first data structure. However, it's more efficient for computers. Also, it's not going to be made by humans, it's going to be computer-generated.

Note that the examples given here do not show off the benefits best - these are bigger when there are more leaves and children for every node. However, it should be noted the worst case scenario for this scheme (one child per node, many levels deep) no worse than the original scheme, and every other scenario is better.

BONUS CONTENT: my shitty sketches of these three schemes

[2016-08-13 18 04 25-2]https://cloud.githubusercontent.com/assets/259840/17644290/79727cd0-6182-11e6-9930-60f807c12ca8.jpg

[2016-08-13 18 04 38-2]https://cloud.githubusercontent.com/assets/259840/17644291/79761bce-6182-11e6-99b4-0311b71d1008.jpg

[2016-08-13 18 07 55-2]https://cloud.githubusercontent.com/assets/259840/17644289/79720552-6182-11e6-9533-76ecf32dc686.jpg

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/Loom/issues/43, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKagzeFe3kbBbmU0WzotcXS4XXyGbL_ks5qffF7gaJpZM4Jjtbs.

JobLeonard commented 8 years ago

One thing to remember is that I would like the tiles to not be simply uniformly colored, but to actually show single cells.

Aaah, yes! I remember now! Now things get interesting

Typed arrays are definitely the way to go in that case, but if we want to cycle over all N-thousand cells that might require a slightly different tree-structure; some form of flattening for optimisations perhaps.

Also, it might be the case that a voronoi treemap is more appropriate, given that a lot of the times our data will not neatly fit into a rectangular shape if we want every pixel to be a meaningful cluster; the flexibility of using convex cells of arbitrary shapes would fix that.

There would be a total of 25,000 different treemaps, one per gene. The layout would be the same for all of them, but the pixel intensities would be different.

This sounds like we might even want to do something as crazy as old-school palette cycling, which can in fact be done in WebGL with shaders.

The idea would be to generate a (voronoi?) treemap image once (per tile/zoom level), with unique pixel values for each cell, and then use a shader to combine these values with a look-up table to generate an output image. Each gene would then have its own look-up table.

Oooh, this is fun to think about, looking forward to mess around with this.

JobLeonard commented 8 years ago

An existing D3 implementation of a Voronoi treemap:

http://cse512-14w.github.io/fp-plvines-djpeter/demo.html

slinnarsson commented 8 years ago

This looks powerful:

https://get.carrotsearch.com/foamtree/demo/demos/index.html

“For pricing, contact us”, so I guess expensive, but maybe they would like to work with us?

Sten Linnarsson, PhD Professor of Molecular Systems Biology Karolinska Institutet Unit of Molecular Neurobiology Department of Medical Biochemistry and Biophysics Scheeles väg 1, 171 77 Stockholm, Sweden +46 8 52 48 75 77 (office) +46 70 399 32 06 (mobile)

On 14 Aug 2016, at 15:34, Job van der Zwan notifications@github.com<mailto:notifications@github.com> wrote:

An existing D3 implementation of a Voronoi treemap:

http://cse512-14w.github.io/fp-plvines-djpeter/demo.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/linnarsson-lab/Loom/issues/43#issuecomment-239673888, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKKagyU9QCaFC0oBezaGVA97i-8S_ZFBks5qfxmBgaJpZM4Jjtbs.

JobLeonard commented 8 years ago

Aww, come on! I was really looking forward to working on implementing this dataviz, and now you show up with a ready-made product? :-P

(Also, it's basically a closed system with a long list of settings you can tweak; that kind of enterprise-y black-box solution kind of isn't appealing to me on a personal level, but I'll happily admit that it can save a lot of time if it fits our use-case)

I have to admit that that's impressive performance for sure, and tweaking the "Relaxation quality treshold" setting in this long list of settings suggests that they're doing Lloyd relaxation in-browser, which implies full Voronoi-treemap generation in the browser. And I obviously can't recreate every feature in that incredibly long list of settings, since they've been working on that with a team for well over a year.

It has strange bugs thoug:

image

image

The interface also has some strange quirks, and it does behave somewhat strangely on my mobile browser (not that that is a main target of us).

Also, just by a quick looking at the page source and the settings they provide, I think I have a decent idea of how they do it, and aligns with how I wanted to approach it too:

image

Regardless of whether we're going to go with this, they have a lot of settings that really allow you to tweak the structure of the treemap (squarified or Voronoi? Are the cells initialised randomly or according to some order? Are they rendered hierarchically or flattened? And so on), so maybe we could look at that next week and figure out which of these would be best suited for analysing single cell RNA data, and which features just get in the way.