Roadmap - Githubissues

I figured it might be nice to have a "general" issue that tracks what I am focusing on now, since all of these separate issues would make it trickier to follow which one is being prioritised. I'll comment here as things progress, so it's easier to track where I am (for both of us, really)

So as noted in #71 I was focusing on the following tasks right now, in order:

minor clean-up of bugs
redesign UI
"paste genes" UI element
lasso select

However, with #73 it becomes obvious that point 2 will be quite a large effort, so I'll shuffle this around: first clean-up, paste genes element, then UI migration, then lasso select.

So my current task list looks like this:

[x] minor clean-up of bugs + small improvements that are directly useful and/or relevant for later coding structure
- [x] filter by gene value functionality
- [x] sort by gene functionality
- [x] view state unification (almost done)
- [x] filter and sort settings as part of view state
[x] "paste genes" UI element
[ ] UI migration - see #38, #73
[ ] Lasso Select - see #74

@slinnarsson: unless you have another use for it, the schema is completely obsolete now as far as the client is concerned. So you can remove the schema from the server/loom spec if you want.

TL;DR: as part of the small fixes round I implemented a simple form of type-inference after acquiring attribute or gene data from the server. It is based on the data itself and does not use the schema provided. This also seems to improve speed/memory performance a bit, because a lot of gene data turns out to fit into uint8 arrays, and most float values are float32 instead of the default float64.

The following assumptions are made when determining the data type:

data is either a string or a number
while there should only one type of data per array, I still added a simple check by comparing the first and last value of the array. If a difference is found we treat it as an array of strings.

If the data is string-based, we check to see how many unique values there are; if less than 256, we index it (like before) so we get the (quite significant) speed boost of using uint8 arrays

If the array contains numbers, we check if it's a float or integer values (by checking whether array[i] === (array[i] | 0). If float, we also check all values fit within float32 values, "compress" the array if so and use float64 otherwise. If an integer value, we compress to the smallest possible container that can hold all values (so the very common case of numbers representing categories, with < 256 categories, gets converted to the uint8 array)

In other words: we now always convert to the most compact data type and infer this at runtime based on the data itself. We do this once, upon acquisition, for attributes as well as genes (turns out a lot of gene values are just a bunch of integers; quite a few of them are converted to uint8 arrays). After that the metadata keeps track of the array type, so copying requires no further checks and is very fast.

New sub-task for "minor clean-up" task that (ironically) keeps growing: restructure redux store tree, rename fields where appropriate, rewrite (read: simplify) code that matches it.

Because the store state organically grew as I added features, the structure is pretty ad-hoc in some places. This has lead to a few issues that should be ironed out, as they cause a lot of dumb issues while coding.

I think it's good to have a "whole-program" check-up like this once in a while anyway, to see if the parts still fit together nicely or if they feel more stuck together with duck-tape.

Here's a view of the redux store after loading a dataset and a few genes (which represents the full datastructure we have now)

Here's the issues I've spotted so far, and how I want to tackle them:

do something about awkard names that do not explain what values they contain and make the code confusing to read. I want to be able to read the data and "get" what it is based on name+structure
- right now data stores all fetched data except for the initial list of projects, dataSets the datasets, dataset the name of one particular dataset. I have frequent typo-based bugs because I mix them up with each other and with local variables named dataSet, datasets, and so on.
- improve (+shorten) names used in viewState tree, as they're converted to URL encoding and are just confusing and too long at the moment.
fix inconsistent encapsulation.
- The initial data for the heatmap range (zoomRange, fullZoomHeight, etc) is in a bunch of different fields in the dataset, which don't make it clear they're heatmap-related at all.
- colAttrs, rowAttrs and fetchedGenes all follow more or less the same logic. Furthermore, fetchedGenes is shaped identically to colAttrs, since both are collections of arrays with cell-data. This pattern could perhaps be represented in the data structure, simplifying some of the code dealing with it (I'm already using "generic" functions for converting arrays to objects that contain typed arrays + metadata, for example). Fields like rowOrder, colOrder, and rowFiltered, colFiltered could also be encapsulated in these objects.
get rid of vestigal fields. We don't really use fetchingGenes anywhere, but it's still in the code. There might be more things like this.
put all useful metadata for data arrays in their objects (already doing most of this, but could be taken to further extremes). Right now a lot of plotter functions require that I explicitly pass a label, for example. It's easier to make one object that holds all possible relevant information about an array in it (raw data, filtered data, name of attribute/gene, colour indices, min/max values, unique values + frequency) so that all information is 'Just There' when I pass that particular attribute/gene to a plotter. Most of this information also only needs to be calculated once, so it will speed up the app too. Allows for simpler, more decoupled code.
open issue: try to think ahead a bit: where and how to store lasso-selections? How to structure the code now so that this is easy to integrate later on?

This will make it easier to modify code later on, and probably reveal a few sneaky bugs we have in there right now.

Ok, so I'm almost done and almost happy with the restructuring. Remember that the idea is as follows:

What do we want to do with the data? (now and in the future)
Find patterns in these use-cases
Use those patterns to structure the data in such a way that it is is easy follow what the data represents, and to accomplish the goals with as much code- and (calculated) data-reuse as possible
Align code with implemented data structures

Proposed Structure

If no type is mentioned, it's an object holding other objects, as part of the data tree.

datasets
- order [ { key: 'lastModified', asc: false}, ... ]
- list
- [dataset name] string (generated from project+filename, to minimise chance of duplicate names)
  - project string (we primarily query by dataset, so it makes more sense to reverse order here)
  - name string
  - doi string
  - url string
  - file string
  - lastModified string
  - description string
  - totalCells string
  - totalGenes string
  - data
  - col (col also contains the fetched genes)
    - filterCount number array (array that keeps track of which indices are filtered out, and by how many filters; works similar to reference-counting)
    - order [ { key: '(original order)', asc: false}, ... ]
    - keys array of strings
    - geneKeys array of strings
    - attrs
    - [attribute or gene]
      - name string (might look like superfluous information, since we need the name to look up an attribute anyway, but this makes it easier for plotters to label an axis, for example, since the passed attribute object will have all required information)
      - data (typed) array
      - filteredData (typed) array
      - dataType type of array
      - mostFrequent [{ val: 3058, count: 3, filtered: false }, ...]
      - colorIndices look-up hashmap
      - min number
      - max number
      - hasZeroes boolean
  - row
    - [see col, but without geneKeys]
  - viewState
    - heatmap (contains former unencapsulated intial state and map boundaries as well)
    - cell
    - MD
    - scape (still not happy with this name)
    - sparkline
    - gene
    - MD
    - scape

Explanations for the proposed changes

fusing `projects` and `dataSets`

Instead of a separate dataSets and projects field, we just have a list of datasets, with project just being a field that is part of individual dataset. This reflects the practical usage in the code better, even though the top-down "hierarchical" organisation is different.

While downloading a list of datasets we get every field except data. When opening a dataset we then detect if a dataset has been fetched by checking if this data field is undefined or not.

Perhaps a server-side rewrite to serve JSON following this structure also makes sense.

make the data structure follow the usage patterns

I'm trying to structure the data in such a way that code reuse is easier. The idea is that by structuring the data in the right hierarchy and with the right metadata, I can make components agnostic to whether the data represents cells or genes (or columns or rows).

Here's how:

cell and gene views both have metadata and scatterplots views. So we structure them similarly.
col and row attributes are essentially the same thing in terms being an array of data + metadata. So if we structure them the same way they can use the same code.
col and gene data have even more overlap: identical sized arrays of representing (meta)data about all cells.

That last point is worth expanding upon, because we have two conflicting needs. First, we want to be able to sort and filter by both genes and column attributes on the same set of cells. That means we probably want to put all of these attributes in one big lookup object. At the same time we need to be able to distinguish genes from column attributes, so we don't want them to be treated the same where they are not.

My solution is to have a field with keys representing attributes (col or row) and geneKeys for genes (col only). That way we can just pass the whole attribute object with all this metadata, and depending on the context our functions just use/ignore either/both set of keys.

This will also remove the need for adding (gene) to the input selections, and testing for it in the code; you can just type to search for attribute or gene in the same field. Should make the interaction a little bit more "fluid", and it also removes all the ugly if (attr === '(gene)'){ ... } code (at least I think it's ugly). The only downside would be that attribute names cannot overlap gene names - is this a realistic worry?

(also, it just occurred to me why I had so much conceptual confusion in the beginning with col/row attributes: column attributes represent data about all columns, but that is essentially a row of data, so in some ways there's some odd flipping of column/row going on)

Structuring the data for future features:

We don't want to restructure all of this again soon, so we should think ahead. This is also why I'm not entirely happy with the structure yet.

Caching

I still would like to get caching of datasets and fetched genes implemented. For that to work properly we need (among other things) to check if the file on the server is the same as the one in the browser-side cache. This is quite simple in our scheme: check if lastModified has changed, so we don't have to do anything special to prepare for this.

Lazy filtering

Right now, whenever we enable/disable a filter, all attributes get re-filtered. This makes sense in the metadata page, where we display all of them, but if we render a scatterplot and only show two attributes, it's kind of silly to filter out dozens of arrays (even though the rendering is the bigger bottleneck). This problem gets worse with larger datasets, and if one has looked at (and thus downloaded) many genes.

So ideally, we would only calculate the filtered data for the attributes that are shown, and then memoize it. Not sure how to approach this yet, but again: it's not the main issue right now so I can go ahead without figuring this out just yet. I also don't expect the implementation of this to be really fundamental; it probably just requires some way of marking data as "dirty" and adding on-the-fly calculation to it.

Lasso Selections

The issue with Lasso Selection is that we end up with two very different ways of selecting data:

filter by attribute and/or gene value
by visual selection of datapoints, which depends on x/y attributes on the moment of selection

The latter being lasso selection. This needs to be structured in such a way that it plays together nicely, while also making it obvious which is which so the code is easy to follow.

First, I think the most sensible thing to do would be to store the selected points, not the path that selected it; as soon as we change x/y attributes that data is useless. So for that to work smoothly there's two things we can store:

indices of the datapoints
selected data per attribute/gene

The reason is how we are going to use lasso selections: the main use will be a filter, so we need to store the indices of the filtered data (because we need that to turn the filter off later). We also want to pre-calc and re-use as much data as possible, so only filter when a selection is made, but at the same time it would be nice if we could highlight the (filtered) elements (which by definition are not in filteredData). So we need to put that somewhere too. And then it has to play nicely with the aforementioned

This is a big deal, so I'll hold off the rewrite until I've figured out a scheme for this that doesn't feel "hacky" and works nicely with the existing features.

So still working on this.

Addendum: I forgot to mention that the proposed fusing of column attributes and genes also makes sorting by both attributes and genes trivial!

Side-note: I noticed that loom_cache.py tests for the presence of a loom file in the cache, but doesn't check if that file has changed. Maybe we want to implement that?

Ok, so I settled on this scheme:

datasets
- order [ { key: 'lastModified', asc: false}, ... ]
- list
- [dataset name] string (project+name, reduce duplicate names)
  - project string (we query by dataset, so invert hierarchy)
  - name string
  - doi string
  - url string
  - file string
  - lastModified string
  - description string
  - totalCells string
  - totalGenes string
  - viewState
    - [see below]
  - data
    - [see below]

viewState entries encapsulate view states per dataset:

heatmap (includes formerly unencapsulated settings)
cell
- MD
- scape (still not happy with this name)
- sparkline
gene
- MD
- scape
... other views we'll create later

Each data entry has the following fields:

col (col also contains the fetched genes)
- keys array of strings
- geneKeys array of strings
- order [ { key: '(original order)', asc: false}, ... ]
- filterCount number array (which indices are filtered out + by how many filters)
- selections [ { indices: <number array>, id: <number>, filtered: <boolean> }, ... ] (array of objects with a unique id (always increasing - I doubt we'll have to worry about overflowing a uint32) + array of the selected indices)
- attrs
  - [attribute or gene] (see below)
row
- [see col, but without geneKeys]

Finally, the attrs entries can follow one of two kinds of schemes:

[attribute or gene]
- name string
- dataType type of array
- data (typed) array
- filteredData (typed) array
- indexedVal (typed) array (if defined, data and filteredData are indices to actual values, which can be looked up here)
- uniques [{ val: 3058, count: 3, filtered: false }, ...]
- colorIndices (always based on data/filteredData, so with indexed arrays this uses indices)
  - mostFreq look-up hashmap(twenty most frequent)
  - max look-up hashmap(twenty largest)
- min number
- max number
- hasZeroes boolean

Or:

[attribute or gene]
- name string
- uniqueVal value

The latter is to accomodate the fairly common case of data arrays that have one unique value. This will reduce memory footprint, make testing for them really easy (is uniqueVal undefined or not), and optimise our interfaces (for example, we can filter them out of the selectable attributes list for plotters, since the output would be garbage anyway).

A few more tweaks have been made to the attribute schema:

Instead of mostFrequent we now have uniques, which simply lists all unique values, in undefined sorting order.
colorIndices is now an object with two hashmaps: twenty most common values, twenty biggest values. I figured those are the two most relevant types of stats. It's easy to extend this if other kinds (least common, smallest values) are also useful.
The keys in colorIndices are always based on the values in data/filteredData, regardless of whether an array is indexed. This means colouring code does not need to care about whether an array is indexed or not.
Similarly, uniques[i].val is always the value seen in data/filteredData, regardless of whether an array is indexed.
- So far I have seen only two use-cases where the actual value matters: labelling, and sorting the uniques array. These are relatively rare and light-weight operations compared to filtering the data array and colouring the plots, which are the more common operations.
instead of indexedString (like we have now) any data type could theoretically be indexed, which is simply checked by the presence of an indexedVal field.
- Indexing will be ordered by mostFrequent. By doing so, when colouring indexed arrays by mostFrequent (which is the more common case), we can replace colorLUT[colorIndices[filteredData[i]]] with colorLUT[filteredData[i]]. This should improve render speeds since that's one layer of indirection gone, and colorIndices is a hashmap (our two main bottlenecks are memory access and drawing operations, this optimises the former).
- As a consequence of this and the fact that we treat zero values differently for colour lookups, indexes start at 1.

Selections will be stored as an array of indices + an id per created selection. The latter is used in the interface, so you can enable/disable filtering out a selection (or everything not selected), or delete a selection.

I have also been thinking about making the canvas interactive, looking at the libraries out there. I'll post a write-up about my plans regarding that in a separate issue.

Parts of the code to update to new data scheme (should be less work than this large list makes it look like):

[x] actions
- [x] project fetching
- [x] dataset fetching
- [x] row/col attribute, genes
- [x] gene fetching
[x] reducers
[x] views
- [x] dataset list
- [x] heatmap
- [x] cellMD
- [x] cellscape
- [x] sparklines
- [x] gene MD
- [x] genescape
[x] plotters
- [x] sparklines
- [x] scatterplot
[x] other components
- [x] fetchGene
- [x] Canvas
- [x] Dropdown
[x] other
- [x] util/convertArray

So, all 110 commits (I fixed the offset-bug in heatmap labels, and added a title) have been pushed. Let me know what I broke along the way! :)

So as far as I can see there's two "blocking" issues that need to be resolved before I push the current update to the main server:

[x] @slinnarsson's feedback about which tweaks are absolutely required (for example, fixing the scatterplot interface isn't a blocking problem, because the current situation not worse than before - I can still push to server and then fix it later)
[x] sparkline categorical colouring bug

I'll try to get the last thingy sorted today, and then go back to the next update, which is rewriting the new server approach until I get feedback from you, Sten. Yes?

linnarsson-lab / loom-viewer

Roadmap #75

Proposed Structure

Explanations for the proposed changes

fusing `projects` and `dataSets`

make the data structure follow the usage patterns

Structuring the data for future features:

Caching

Lazy filtering

Lasso Selections

linnarsson-lab / loom-viewer

Roadmap #75

Proposed Structure

Explanations for the proposed changes

fusing projects and dataSets

make the data structure follow the usage patterns

Structuring the data for future features:

Caching

Lazy filtering

Lasso Selections

fusing `projects` and `dataSets`