linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Roadmap #75

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

I figured it might be nice to have a "general" issue that tracks what I am focusing on now, since all of these separate issues would make it trickier to follow which one is being prioritised. I'll comment here as things progress, so it's easier to track where I am (for both of us, really)

So as noted in #71 I was focusing on the following tasks right now, in order:

  1. minor clean-up of bugs
  2. redesign UI
  3. "paste genes" UI element
  4. lasso select

However, with #73 it becomes obvious that point 2 will be quite a large effort, so I'll shuffle this around: first clean-up, paste genes element, then UI migration, then lasso select.

So my current task list looks like this:

JobLeonard commented 7 years ago

@slinnarsson: unless you have another use for it, the schema is completely obsolete now as far as the client is concerned. So you can remove the schema from the server/loom spec if you want.

TL;DR: as part of the small fixes round I implemented a simple form of type-inference after acquiring attribute or gene data from the server. It is based on the data itself and does not use the schema provided. This also seems to improve speed/memory performance a bit, because a lot of gene data turns out to fit into uint8 arrays, and most float values are float32 instead of the default float64.

The following assumptions are made when determining the data type:

If the data is string-based, we check to see how many unique values there are; if less than 256, we index it (like before) so we get the (quite significant) speed boost of using uint8 arrays

If the array contains numbers, we check if it's a float or integer values (by checking whether array[i] === (array[i] | 0). If float, we also check all values fit within float32 values, "compress" the array if so and use float64 otherwise. If an integer value, we compress to the smallest possible container that can hold all values (so the very common case of numbers representing categories, with < 256 categories, gets converted to the uint8 array)

In other words: we now always convert to the most compact data type and infer this at runtime based on the data itself. We do this once, upon acquisition, for attributes as well as genes (turns out a lot of gene values are just a bunch of integers; quite a few of them are converted to uint8 arrays). After that the metadata keeps track of the array type, so copying requires no further checks and is very fast.

JobLeonard commented 7 years ago

New sub-task for "minor clean-up" task that (ironically) keeps growing: restructure redux store tree, rename fields where appropriate, rewrite (read: simplify) code that matches it.

Because the store state organically grew as I added features, the structure is pretty ad-hoc in some places. This has lead to a few issues that should be ironed out, as they cause a lot of dumb issues while coding.

I think it's good to have a "whole-program" check-up like this once in a while anyway, to see if the parts still fit together nicely or if they feel more stuck together with duck-tape.

Here's a view of the redux store after loading a dataset and a few genes (which represents the full datastructure we have now)

image

Here's the issues I've spotted so far, and how I want to tackle them:

This will make it easier to modify code later on, and probably reveal a few sneaky bugs we have in there right now.

JobLeonard commented 7 years ago

Ok, so I'm almost done and almost happy with the restructuring. Remember that the idea is as follows:

Proposed Structure

If no type is mentioned, it's an object holding other objects, as part of the data tree.

Explanations for the proposed changes

fusing projects and dataSets

Instead of a separate dataSets and projects field, we just have a list of datasets, with project just being a field that is part of individual dataset. This reflects the practical usage in the code better, even though the top-down "hierarchical" organisation is different.

While downloading a list of datasets we get every field except data. When opening a dataset we then detect if a dataset has been fetched by checking if this data field is undefined or not.

Perhaps a server-side rewrite to serve JSON following this structure also makes sense.

make the data structure follow the usage patterns

I'm trying to structure the data in such a way that code reuse is easier. The idea is that by structuring the data in the right hierarchy and with the right metadata, I can make components agnostic to whether the data represents cells or genes (or columns or rows).

Here's how:

That last point is worth expanding upon, because we have two conflicting needs. First, we want to be able to sort and filter by both genes and column attributes on the same set of cells. That means we probably want to put all of these attributes in one big lookup object. At the same time we need to be able to distinguish genes from column attributes, so we don't want them to be treated the same where they are not.

My solution is to have a field with keys representing attributes (col or row) and geneKeys for genes (col only). That way we can just pass the whole attribute object with all this metadata, and depending on the context our functions just use/ignore either/both set of keys.

This will also remove the need for adding (gene) to the input selections, and testing for it in the code; you can just type to search for attribute or gene in the same field. Should make the interaction a little bit more "fluid", and it also removes all the ugly if (attr === '(gene)'){ ... } code (at least I think it's ugly). The only downside would be that attribute names cannot overlap gene names - is this a realistic worry?

(also, it just occurred to me why I had so much conceptual confusion in the beginning with col/row attributes: column attributes represent data about all columns, but that is essentially a row of data, so in some ways there's some odd flipping of column/row going on)

Structuring the data for future features:

We don't want to restructure all of this again soon, so we should think ahead. This is also why I'm not entirely happy with the structure yet.

Caching

I still would like to get caching of datasets and fetched genes implemented. For that to work properly we need (among other things) to check if the file on the server is the same as the one in the browser-side cache. This is quite simple in our scheme: check if lastModified has changed, so we don't have to do anything special to prepare for this.

Lazy filtering

Right now, whenever we enable/disable a filter, all attributes get re-filtered. This makes sense in the metadata page, where we display all of them, but if we render a scatterplot and only show two attributes, it's kind of silly to filter out dozens of arrays (even though the rendering is the bigger bottleneck). This problem gets worse with larger datasets, and if one has looked at (and thus downloaded) many genes.

So ideally, we would only calculate the filtered data for the attributes that are shown, and then memoize it. Not sure how to approach this yet, but again: it's not the main issue right now so I can go ahead without figuring this out just yet. I also don't expect the implementation of this to be really fundamental; it probably just requires some way of marking data as "dirty" and adding on-the-fly calculation to it.

Lasso Selections

The issue with Lasso Selection is that we end up with two very different ways of selecting data:

The latter being lasso selection. This needs to be structured in such a way that it plays together nicely, while also making it obvious which is which so the code is easy to follow.

First, I think the most sensible thing to do would be to store the selected points, not the path that selected it; as soon as we change x/y attributes that data is useless. So for that to work smoothly there's two things we can store:

The reason is how we are going to use lasso selections: the main use will be a filter, so we need to store the indices of the filtered data (because we need that to turn the filter off later). We also want to pre-calc and re-use as much data as possible, so only filter when a selection is made, but at the same time it would be nice if we could highlight the (filtered) elements (which by definition are not in filteredData). So we need to put that somewhere too. And then it has to play nicely with the aforementioned

This is a big deal, so I'll hold off the rewrite until I've figured out a scheme for this that doesn't feel "hacky" and works nicely with the existing features.

So still working on this.

JobLeonard commented 7 years ago

Addendum: I forgot to mention that the proposed fusing of column attributes and genes also makes sorting by both attributes and genes trivial!

JobLeonard commented 7 years ago

Side-note: I noticed that loom_cache.py tests for the presence of a loom file in the cache, but doesn't check if that file has changed. Maybe we want to implement that?

JobLeonard commented 7 years ago

Ok, so I settled on this scheme:

viewState entries encapsulate view states per dataset:

Each data entry has the following fields:

Finally, the attrs entries can follow one of two kinds of schemes:

Or:

The latter is to accomodate the fairly common case of data arrays that have one unique value. This will reduce memory footprint, make testing for them really easy (is uniqueVal undefined or not), and optimise our interfaces (for example, we can filter them out of the selectable attributes list for plotters, since the output would be garbage anyway).

A few more tweaks have been made to the attribute schema:

  1. Instead of mostFrequent we now have uniques, which simply lists all unique values, in undefined sorting order.
  2. colorIndices is now an object with two hashmaps: twenty most common values, twenty biggest values. I figured those are the two most relevant types of stats. It's easy to extend this if other kinds (least common, smallest values) are also useful.
  3. The keys in colorIndices are always based on the values in data/filteredData, regardless of whether an array is indexed. This means colouring code does not need to care about whether an array is indexed or not.
  4. Similarly, uniques[i].val is always the value seen in data/filteredData, regardless of whether an array is indexed.
    • So far I have seen only two use-cases where the actual value matters: labelling, and sorting the uniques array. These are relatively rare and light-weight operations compared to filtering the data array and colouring the plots, which are the more common operations.
  5. instead of indexedString (like we have now) any data type could theoretically be indexed, which is simply checked by the presence of an indexedVal field.
    • Indexing will be ordered by mostFrequent. By doing so, when colouring indexed arrays by mostFrequent (which is the more common case), we can replace colorLUT[colorIndices[filteredData[i]]] with colorLUT[filteredData[i]]. This should improve render speeds since that's one layer of indirection gone, and colorIndices is a hashmap (our two main bottlenecks are memory access and drawing operations, this optimises the former).
    • As a consequence of this and the fact that we treat zero values differently for colour lookups, indexes start at 1.

Selections will be stored as an array of indices + an id per created selection. The latter is used in the interface, so you can enable/disable filtering out a selection (or everything not selected), or delete a selection.

I have also been thinking about making the canvas interactive, looking at the libraries out there. I'll post a write-up about my plans regarding that in a separate issue.

JobLeonard commented 7 years ago

Parts of the code to update to new data scheme (should be less work than this large list makes it look like):

JobLeonard commented 7 years ago

So, all 110 commits (I fixed the offset-bug in heatmap labels, and added a title) have been pushed. Let me know what I broke along the way! :)

JobLeonard commented 7 years ago

So as far as I can see there's two "blocking" issues that need to be resolved before I push the current update to the main server:

I'll try to get the last thingy sorted today, and then go back to the next update, which is rewriting the new server approach until I get feedback from you, Sten. Yes?

JobLeonard commented 7 years ago

Continued in #98