Closed JobLeonard closed 7 years ago
See, I already made a mistake in the previous scheme: we're not going to filter per attribute, but per unique value in an attribute. So Class
is an attribute, oligodendrocytes
a unique value within this array. We filter on the latter.
Also, if one would filter out all oligo's in the Class
attribute, then obviously all associated subclasses in values in Subclass
would also be filtered out. It is probably useful to keep track of this, so that we, for example, can hide invisible items from the legend as well.
{
dataSets: {
[dataset]: {
filteredRowIndices: [0, 3, 4, 5, 14, 15, …],
rowAttrs: {
'(original order)': [0, 1, 2, 3, …],
[rowAttribute]: {
data: ['array', 'with', 'attribute', 'data', …],
mostFrequent: [
{
val: 'most common value',
count: /* nr of occurrences */,
filtered: false,
visible: true,
},
{
val: 'second most common value',
count: /* nr of occurrences */,
filtered: false,
visible: true,
},
…
],
colorIndices: /* hasmap of the twenty most
frequent values as keys,
with value being an index
for the color LUTs */
}
}
}
}
}
List of things to update:
filteredIndices
to select which data to pass to plotsfilteredIndices
?BTW, this probably looks like more work than it is (although it is a lot of work) - I'm just making a fine-grained overview of everything that needs updating so no part gets forgotten and we don't get new bugs.
Aaaand yet another change of plans. Luckily, it will make things much simpler.
I just remembered how wickedly dynamic JavaScrip really is: specifically, that arrays are objects (with some special tricks) in JS . Even typed arrays. Which means that instead of wrapping the attribute data arrays in an object with metadata, I can also just add all of this metadata to the array object itself. This will simplify things greatly, because it means all necessary metadata will always be available to components working with them; no need to pass the schema
through the props or anything.
float32
, float64
, string
, etc)colors
: hashmapfiltered
: boolean visible
: boolean(Also, In the process of switching to this scheme I've also implemented indexedString
arrays. String rows/columns with less than 256 unique values us it. Under the hood uses Uint8
arrays with a tiny string look-up-table. The decrease in lag is notable on my machine.)
Ok, scratch that last idea: it requires copying the entire original data array whenever we change the filtered data. Very wasteful. I also just wasted a day hunting bugs related to accidentally cyclic data structures because of this "great" plan. Arg.
The new structure is somewhere in-between the last two ideas:
{
dataSets: {
[dataset]: {
rowAttrs: {
[rowAttribute]: {
data: ['array', 'with', 'all', 'attribute', 'data', …],
filteredData: ['array', 'with', 'attribute', 'data', …],
colorIndices: /* hasmap of the twenty most frequent values as keys,
with value being an index for the color LUTs */
mostFrequent: [
{
val: 'most common value',
count: /* nr of occurrences */,
filtered: false,
visible: true,
},
{
val: 'second most common value',
count: /* nr of occurrences */,
filtered: false,
visible: true,
},
…,
{
val: 'all',
count: /* nr of occurrences */,
filtered: true,
visible: false,
},
…
],
}
}
}
}
}
We're getting there, more slowly than I want, but we're getting there.
I just implemented a basic check when the data arrays are first fetched to see if float64
arrays are actually not secretly integer
arrays. And one that compacts integer
arrays to their smallest value. Turns out 90% of our data fits in uint8
arrays.
The performance boost is real, holy moly.
So this is almost finished - a lot of improvements have been made along the way. Some small, some big.
Furthermore, these optimisations improved speed and reduced memory footprint:
Float64
array contains only integer values between 0 and 256, it is converted to Uint8
; this is notably faster, I might add)Note that only the last bullet point is directly related to this issue... But you know, all this stuff is interconnected, and I basically just did a lot of long-overdue cleanup along the way.
There are a few known bugs left to iron out, I'll spend tomorrow fixing those and checking if I missed any other sneaky ones. If I don't find anything obvious I'll push the new version to the server.
Implemented
As always, this is self-documentation and Rubber Duck Debugging/Architecture Planning. This is just a plan, things will change as I implement it, but since it touches almost every part of the client-side code-base I want to write it out first.
So this actually grew out of of having to fix that colouring bug in the metadata views. Currently we do a lot of recalculations of information that is not changing.
For example, the row and column metadata will not change for a given dataset. Yet every time we render a sparkline or scatterplot we calculate the 20 most frequent values on the fly, then determine the colours. This is wasteful, but ok: in my tests so far, even mobile browsers barely hiccup on this. Similarly, the way I have implemented sorting the data in sparklines is on the fly, every time the sparkline view is rendered.
We can do better, and probably simplify and clean up the code in the process.
What I want to do is calculate all of this information once, then let the views just display static data whenever possible. This should be done at the point when a dataset has been successfully fetched (or in the case of sorting/filtering: whenever the relevant setting changes).
For sorting/filtering, we'll store the original row and column orders as simple arrays (
[0, 1, 2, 3, ..]
). For values that are filtered out, we copy the array to afilteredIndices
without the indexes of the values filtered out (I've already made this, btw, it's just not used anywhere yet).The next step is to store the keys we sort by and in what order:
We can implement stable sort by moving a selected sort key to the front of the array, and then sort
filteredIndices
using a custom comparison function that goes through the array.In all of our views where we display row- or columndata, we then pass these filtered indices along with the data, then make a copy of the data using filtered indices:
Then we use
dataSubset
where we would use the originaldata
before.As for colour selection, we'll just calculate
nMostfrequent
for each row/column attribute when we receive the dataset, store that, and create a lookup hashtable for the colours.So the relevant part of the redux state tree will change as follows:
(changes are same for column attributes)
Note that I added an
(original order)
attribute. This is a simple trick to make sorting by original order trivial: just put that "attribute" to the front of the list of sort keys; means we can simplify our algorithm for that.This will require some deep rewrites in the plotting functions and the views, but the result should be a much more elegant code base (maybe smaller, even!), and easier to follow data structures in the redux side. It will also make sharing settings between the views easier, or sometimes even happen for free: since whether or not a row attribute is filtered will become part of the row attribute, that setting will be automatically shared between the views.