Pre-calc and memoize data when sensible

JobLeonard commented 7 years ago

As always, this is self-documentation and Rubber Duck Debugging/Architecture Planning. This is just a plan, things will change as I implement it, but since it touches almost every part of the client-side code-base I want to write it out first.

So this actually grew out of of having to fix that colouring bug in the metadata views. Currently we do a lot of recalculations of information that is not changing.

For example, the row and column metadata will not change for a given dataset. Yet every time we render a sparkline or scatterplot we calculate the 20 most frequent values on the fly, then determine the colours. This is wasteful, but ok: in my tests so far, even mobile browsers barely hiccup on this. Similarly, the way I have implemented sorting the data in sparklines is on the fly, every time the sparkline view is rendered.

We can do better, and probably simplify and clean up the code in the process.

What I want to do is calculate all of this information once, then let the views just display static data whenever possible. This should be done at the point when a dataset has been successfully fetched (or in the case of sorting/filtering: whenever the relevant setting changes).

For sorting/filtering, we'll store the original row and column orders as simple arrays ([0, 1, 2, 3, ..]). For values that are filtered out, we copy the array to a filteredIndices without the indexes of the values filtered out (I've already made this, btw, it's just not used anywhere yet).

The next step is to store the keys we sort by and in what order:

[{ key: <first row attribute to sort by>, ascending: true}, { key: <second row attribute>, ascending: false }, …]

We can implement stable sort by moving a selected sort key to the front of the array, and then sort filteredIndices using a custom comparison function that goes through the array.

In all of our views where we display row- or columndata, we then pass these filtered indices along with the data, then make a copy of the data using filtered indices:

let dataSubset = [];
for (let i = 0; i < filteredIndices.length; i++){
  dataSubset.push(data[filteredIndices[i]]);
}

Then we use dataSubset where we would use the original data before.

As for colour selection, we'll just calculate nMostfrequent for each row/column attribute when we receive the dataset, store that, and create a lookup hashtable for the colours.

So the relevant part of the redux state tree will change as follows:

//before
{ 
  dataSets: { 
    [dataset]: { 
      rowAttrs: { 
        [rowAttribute]: ['array', 'with', 'attribute', 'data'] 
      } 
    } 
  } 
}

//after
{ 
  dataSets: { 
    [dataset]: {
      filteredRowIndices: [0, 3, 4, 5, 14, 15, ...],
      rowAttrs: { 
        '(original order)': [0, 1, 2, 3, ...], 
        [rowAttribute]: {
          data: ['array', 'with', 'attribute', 'data'],
          mostFrequent: <result of nMostFrequent call>,
          colours: <hasmap of 20 most frequent values as keys,
                    and index value for color table>,
          filtered: false,
        }
      }
    }
  }
}

(changes are same for column attributes)

Note that I added an (original order) attribute. This is a simple trick to make sorting by original order trivial: just put that "attribute" to the front of the list of sort keys; means we can simplify our algorithm for that.

This will require some deep rewrites in the plotting functions and the views, but the result should be a much more elegant code base (maybe smaller, even!), and easier to follow data structures in the redux side. It will also make sharing settings between the views easier, or sometimes even happen for free: since whether or not a row attribute is filtered will become part of the row attribute, that setting will be automatically shared between the views.

JobLeonard commented 7 years ago

See, I already made a mistake in the previous scheme: we're not going to filter per attribute, but per unique value in an attribute. So Class is an attribute, oligodendrocytes a unique value within this array. We filter on the latter.

Also, if one would filter out all oligo's in the Class attribute, then obviously all associated subclasses in values in Subclass would also be filtered out. It is probably useful to keep track of this, so that we, for example, can hide invisible items from the legend as well.

{ 
  dataSets: { 
    [dataset]: {
      filteredRowIndices: [0, 3, 4, 5, 14, 15, …],
      rowAttrs: { 
        '(original order)': [0, 1, 2, 3, …], 
        [rowAttribute]: {
          data: ['array', 'with', 'attribute', 'data', …],
          mostFrequent: [
            {
              val: 'most common value',
              count: /* nr of occurrences */,
              filtered: false,
              visible: true,
            }, 
            {
              val: 'second most common value',
              count: /* nr of occurrences */,
              filtered: false,
              visible: true,
            }, 
            …
          ],
          colorIndices: /* hasmap of the twenty most
                           frequent values as keys,
                           with value being an index
                           for the color LUTs */
        }
      }
    }
  }
}

JobLeonard commented 7 years ago

List of things to update:

[x] actions: fetching datasets
[x] redux reducers
[x] update component props to match new structure
- [x] views
- [x] cell metadata
- [x] gene metadata
- [x] cell scatterplot
- [x] gene scatterplot
- [x] sparkline
- [x] heatmap
- [x] side-panels
- [x] cell scatterplot
- [x] gene scatterplot
- [x] sparklines
- [x] heatmap
- [x] plots
- [x] sparkline plotters
- [x] scatterplotters
[ ] action dispatching
- [ ] move metadata view state (i.e. which sparkline plot to show) to redux
- [x] side-panels
- [x] match new structure where needed
- [x] add clickable legends
[x] other logic
- [x] views
- [x] use filteredIndices to select which data to pass to plots
- [ ] should heatmap/categorical plotting logic be moved to views? Create a generic in-between component for that?
- [x] side-panels
- [x] add color legends
- [x] plots
- [x] remove legends
- [x] data processing
  - [x] use color LUT based on indices passed as props
  - [x] ~~use filteredIndices?~~ Passing the correct data the responsibility of the view

BTW, this probably looks like more work than it is (although it is a lot of work) - I'm just making a fine-grained overview of everything that needs updating so no part gets forgotten and we don't get new bugs.

JobLeonard commented 7 years ago

Aaaand yet another change of plans. Luckily, it will make things much simpler.

I just remembered how wickedly dynamic JavaScrip really is: specifically, that arrays are objects (with some special tricks) in JS . Even typed arrays. Which means that instead of wrapping the attribute data arrays in an object with metadata, I can also just add all of this metadata to the array object itself. This will simplify things greatly, because it means all necessary metadata will always be available to components working with them; no need to pass the schema through the props or anything.

[x] type : string (float32, float64, string, etc)
[x] mostFrequent : array
[x] colors : hashmap
[x] filtered : boolean
[x] visible : boolean

(Also, In the process of switching to this scheme I've also implemented indexedString arrays. String rows/columns with less than 256 unique values us it. Under the hood uses Uint8 arrays with a tiny string look-up-table. The decrease in lag is notable on my machine.)

JobLeonard commented 7 years ago

Ok, scratch that last idea: it requires copying the entire original data array whenever we change the filtered data. Very wasteful. I also just wasted a day hunting bugs related to accidentally cyclic data structures because of this "great" plan. Arg.

The new structure is somewhere in-between the last two ideas:

{ 
  dataSets: { 
    [dataset]: {
      rowAttrs: { 
        [rowAttribute]: {
          data: ['array', 'with', 'all', 'attribute', 'data', …],
          filteredData: ['array', 'with', 'attribute', 'data', …],
          colorIndices: /* hasmap of the twenty most frequent values as keys,
                           with value being an index for the color LUTs */
          mostFrequent: [
            {
              val: 'most common value',
              count: /* nr of occurrences */,
              filtered: false,
              visible: true,
            }, 
            {
              val: 'second most common value',
              count: /* nr of occurrences */,
              filtered: false,
              visible: true,
            }, 
            …, 
            {
              val: 'all',
              count: /* nr of occurrences */,
              filtered: true,
              visible: false,
            },
            …
          ],
        }
      }
    }
  }
}

We're getting there, more slowly than I want, but we're getting there.

JobLeonard commented 7 years ago

I just implemented a basic check when the data arrays are first fetched to see if float64 arrays are actually not secretly integer arrays. And one that compacts integer arrays to their smallest value. Turns out 90% of our data fits in uint8 arrays.

The performance boost is real, holy moly.

JobLeonard commented 7 years ago

So this is almost finished - a lot of improvements have been made along the way. Some small, some big.

the new searchable dataset overview as the default landing page
an extra heatmap colour ramp
"search metadata" support
"sort by attribute" support
legend moved from scatterplot to side-panel (+ filter capabilities)
legend added to sparklines (+ filter capabilities)
genescape/landscape matrix support

Furthermore, these optimisations improved speed and reduced memory footprint:

indexed string arrays
typed arrays
client-side optimisation of number arrays (example: if a fetched Float64 array contains only integer values between 0 and 256, it is converted to Uint8; this is notably faster, I might add)
pre-calc metadata of arrays for re-use:
- min/max values
- nr of unique values + nr of occurrences
- the result of filtering/sorting the data with current settings
- color LUT

Note that only the last bullet point is directly related to this issue... But you know, all this stuff is interconnected, and I basically just did a lot of long-overdue cleanup along the way.

There are a few known bugs left to iron out, I'll spend tomorrow fixing those and checking if I missed any other sneaky ones. If I don't find anything obvious I'll push the new version to the server.

JobLeonard commented 7 years ago

Implemented

linnarsson-lab / loom-viewer

Pre-calc and memoize data when sensible #66