Closed JobLeonard closed 7 years ago
@slinnarsson: unless you have another use for it, the schema is completely obsolete now as far as the client is concerned. So you can remove the schema from the server/loom spec if you want.
TL;DR: as part of the small fixes round I implemented a simple form of type-inference after acquiring attribute or gene data from the server. It is based on the data itself and does not use the schema provided. This also seems to improve speed/memory performance a bit, because a lot of gene data turns out to fit into uint8
arrays, and most float values are float32
instead of the default float64
.
The following assumptions are made when determining the data type:
If the data is string-based, we check to see how many unique values there are; if less than 256, we index it (like before) so we get the (quite significant) speed boost of using uint8
arrays
If the array contains numbers, we check if it's a float or integer values (by checking whether array[i] === (array[i] | 0)
. If float, we also check all values fit within float32
values, "compress" the array if so and use float64
otherwise. If an integer value, we compress to the smallest possible container that can hold all values (so the very common case of numbers representing categories, with < 256 categories, gets converted to the uint8
array)
In other words: we now always convert to the most compact data type and infer this at runtime based on the data itself. We do this once, upon acquisition, for attributes as well as genes (turns out a lot of gene values are just a bunch of integers; quite a few of them are converted to uint8
arrays). After that the metadata keeps track of the array type, so copying requires no further checks and is very fast.
New sub-task for "minor clean-up" task that (ironically) keeps growing: restructure redux store tree, rename fields where appropriate, rewrite (read: simplify) code that matches it.
Because the store state organically grew as I added features, the structure is pretty ad-hoc in some places. This has lead to a few issues that should be ironed out, as they cause a lot of dumb issues while coding.
I think it's good to have a "whole-program" check-up like this once in a while anyway, to see if the parts still fit together nicely or if they feel more stuck together with duck-tape.
Here's a view of the redux store after loading a dataset and a few genes (which represents the full datastructure we have now)
Here's the issues I've spotted so far, and how I want to tackle them:
data
stores all fetched data except for the initial list of projects, dataSets
the datasets, dataset
the name of one particular dataset. I have frequent typo-based bugs because I mix them up with each other and with local variables named dataSet
, datasets
, and so on.viewState
tree, as they're converted to URL encoding and are just confusing and too long at the moment.zoomRange
, fullZoomHeight
, etc) is in a bunch of different fields in the dataset, which don't make it clear they're heatmap-related at all.colAttrs
, rowAttrs
and fetchedGenes
all follow more or less the same logic. Furthermore, fetchedGenes is shaped identically to colAttrs, since both are collections of arrays with cell-data. This pattern could perhaps be represented in the data structure, simplifying some of the code dealing with it (I'm already using "generic" functions for converting arrays to objects that contain typed arrays + metadata, for example). Fields like rowOrder
, colOrder
, and rowFiltered
, colFiltered
could also be encapsulated in these objects.fetchingGenes
anywhere, but it's still in the code. There might be more things like this.This will make it easier to modify code later on, and probably reveal a few sneaky bugs we have in there right now.
Ok, so I'm almost done and almost happy with the restructuring. Remember that the idea is as follows:
If no type is mentioned, it's an object holding other objects, as part of the data tree.
[ { key: 'lastModified', asc: false}, ... ]
string
(generated from project+filename, to minimise chance of duplicate names)
string
(we primarily query by dataset, so it makes more sense to reverse order here)string
string
string
string
string
string
string
string
col
also contains the fetched genes)
number array
(array that keeps track of which indices are filtered out, and by how many filters; works similar to reference-counting)[ { key: '(original order)', asc: false}, ... ]
array of strings
array of strings
string
(might look like superfluous information, since we need the name to look up an attribute anyway, but this makes it easier for plotters to label an axis, for example, since the passed attribute object will have all required information)(typed) array
(typed) array
type of array
[{ val: 3058, count: 3, filtered: false }, ...]
look-up hashmap
number
number
boolean
geneKeys
]projects
and dataSets
Instead of a separate dataSets
and projects
field, we just have a list of datasets, with project
just being a field that is part of individual dataset. This reflects the practical usage in the code better, even though the top-down "hierarchical" organisation is different.
While downloading a list of datasets we get every field except data
. When opening a dataset we then detect if a dataset has been fetched by checking if this data
field is undefined or not.
Perhaps a server-side rewrite to serve JSON following this structure also makes sense.
I'm trying to structure the data in such a way that code reuse is easier. The idea is that by structuring the data in the right hierarchy and with the right metadata, I can make components agnostic to whether the data represents cells or genes (or columns or rows).
Here's how:
cell
and gene
views both have metadata and scatterplots views. So we structure them similarly.col
and row
attributes are essentially the same thing in terms being an array of data + metadata. So if we structure them the same way they can use the same code.col
and gene
data have even more overlap: identical sized arrays of representing (meta)data about all cells. That last point is worth expanding upon, because we have two conflicting needs. First, we want to be able to sort and filter by both genes and column attributes on the same set of cells. That means we probably want to put all of these attributes in one big lookup object. At the same time we need to be able to distinguish genes from column attributes, so we don't want them to be treated the same where they are not.
My solution is to have a field with keys
representing attributes (col
or row
) and geneKeys
for genes (col
only). That way we can just pass the whole attribute object with all this metadata, and depending on the context our functions just use/ignore either/both set of keys.
This will also remove the need for adding (gene)
to the input selections, and testing for it in the code; you can just type to search for attribute or gene in the same field. Should make the interaction a little bit more "fluid", and it also removes all the ugly if (attr === '(gene)'){ ... }
code (at least I think it's ugly). The only downside would be that attribute names cannot overlap gene names - is this a realistic worry?
(also, it just occurred to me why I had so much conceptual confusion in the beginning with col/row attributes: column attributes represent data about all columns, but that is essentially a row of data, so in some ways there's some odd flipping of column/row going on)
We don't want to restructure all of this again soon, so we should think ahead. This is also why I'm not entirely happy with the structure yet.
I still would like to get caching of datasets and fetched genes implemented. For that to work properly we need (among other things) to check if the file on the server is the same as the one in the browser-side cache. This is quite simple in our scheme: check if lastModified
has changed, so we don't have to do anything special to prepare for this.
Right now, whenever we enable/disable a filter, all attributes get re-filtered. This makes sense in the metadata page, where we display all of them, but if we render a scatterplot and only show two attributes, it's kind of silly to filter out dozens of arrays (even though the rendering is the bigger bottleneck). This problem gets worse with larger datasets, and if one has looked at (and thus downloaded) many genes.
So ideally, we would only calculate the filtered data for the attributes that are shown, and then memoize it. Not sure how to approach this yet, but again: it's not the main issue right now so I can go ahead without figuring this out just yet. I also don't expect the implementation of this to be really fundamental; it probably just requires some way of marking data as "dirty" and adding on-the-fly calculation to it.
The issue with Lasso Selection is that we end up with two very different ways of selecting data:
The latter being lasso selection. This needs to be structured in such a way that it plays together nicely, while also making it obvious which is which so the code is easy to follow.
First, I think the most sensible thing to do would be to store the selected points, not the path that selected it; as soon as we change x/y attributes that data is useless. So for that to work smoothly there's two things we can store:
The reason is how we are going to use lasso selections: the main use will be a filter, so we need to store the indices of the filtered data (because we need that to turn the filter off later). We also want to pre-calc and re-use as much data as possible, so only filter when a selection is made, but at the same time it would be nice if we could highlight the (filtered) elements (which by definition are not in filteredData
). So we need to put that somewhere too. And then it has to play nicely with the aforementioned
This is a big deal, so I'll hold off the rewrite until I've figured out a scheme for this that doesn't feel "hacky" and works nicely with the existing features.
So still working on this.
Addendum: I forgot to mention that the proposed fusing of column attributes and genes also makes sorting by both attributes and genes trivial!
Side-note: I noticed that loom_cache.py
tests for the presence of a loom file in the cache, but doesn't check if that file has changed. Maybe we want to implement that?
Ok, so I settled on this scheme:
[ { key: 'lastModified', asc: false}, ... ]
string
(project
+name
, reduce duplicate names)
string
(we query by dataset, so invert hierarchy)string
string
string
string
string
string
string
string
viewState
entries encapsulate view states per dataset:
Each data
entry has the following fields:
col
also contains the fetched genes)
array of strings
array of strings
[ { key: '(original order)', asc: false}, ... ]
number array
(which indices are filtered out + by how many filters)[ { indices: <number array>, id: <number>, filtered: <boolean> }, ... ]
(array of objects with a unique id (always increasing - I doubt we'll have to worry about overflowing a uint32) + array of the selected indices)geneKeys
]Finally, the attrs
entries can follow one of two kinds of schemes:
string
type of array
(typed) array
(typed) array
(typed) array
(if defined, data
and filteredData
are indices to actual values, which can be looked up here)[{ val: 3058, count: 3, filtered: false }, ...]
data
/filteredData
, so with indexed arrays this uses indices)
look-up hashmap
(twenty most frequent)look-up hashmap
(twenty largest)number
number
boolean
Or:
string
value
The latter is to accomodate the fairly common case of data arrays that have one unique value. This will reduce memory footprint, make testing for them really easy (is uniqueVal
undefined or not), and optimise our interfaces (for example, we can filter them out of the selectable attributes list for plotters, since the output would be garbage anyway).
A few more tweaks have been made to the attribute schema:
mostFrequent
we now have uniques
, which simply lists all unique values, in undefined sorting order.colorIndices
is now an object with two hashmaps: twenty most common values, twenty biggest values. I figured those are the two most relevant types of stats. It's easy to extend this if other kinds (least common, smallest values) are also useful.colorIndices
are always based on the values in data
/filteredData
, regardless of whether an array is indexed. This means colouring code does not need to care about whether an array is indexed or not.uniques[i].val
is always the value seen in data
/filteredData
, regardless of whether an array is indexed.
uniques
array. These are relatively rare and light-weight operations compared to filtering the data
array and colouring the plots, which are the more common operations.indexedString
(like we have now) any data type could theoretically be indexed, which is simply checked by the presence of an indexedVal
field.
mostFrequent
. By doing so, when colouring indexed arrays by mostFrequent
(which is the more common case), we can replace colorLUT[colorIndices[filteredData[i]]]
with colorLUT[filteredData[i]]
. This should improve render speeds since that's one layer of indirection gone, and colorIndices
is a hashmap (our two main bottlenecks are memory access and drawing operations, this optimises the former).Selections will be stored as an array of indices + an id per created selection. The latter is used in the interface, so you can enable/disable filtering out a selection (or everything not selected), or delete a selection.
I have also been thinking about making the canvas interactive, looking at the libraries out there. I'll post a write-up about my plans regarding that in a separate issue.
Parts of the code to update to new data scheme (should be less work than this large list makes it look like):
So, all 110 commits (I fixed the offset-bug in heatmap labels, and added a title) have been pushed. Let me know what I broke along the way! :)
So as far as I can see there's two "blocking" issues that need to be resolved before I push the current update to the main server:
I'll try to get the last thingy sorted today, and then go back to the next update, which is rewriting the new server approach until I get feedback from you, Sten. Yes?
Continued in #98
I figured it might be nice to have a "general" issue that tracks what I am focusing on now, since all of these separate issues would make it trickier to follow which one is being prioritised. I'll comment here as things progress, so it's easier to track where I am (for both of us, really)
So as noted in #71 I was focusing on the following tasks right now, in order:
However, with #73 it becomes obvious that point 2 will be quite a large effort, so I'll shuffle this around: first clean-up, paste genes element, then UI migration, then lasso select.
So my current task list looks like this: