Move attribute type "inference" to the python side

JobLeonard commented 7 years ago

This is an enhancement we can apply if the browser becomes too slow to load large datasets. It only applies to receiving a dataset for the first time! In other words: it reduces time between clicking a dataset and being able to do something with the plots.

Currently, one remaining bottle-neck on this front is converting the arrays passed to the client to efficient TypedArrays - for example, arrays that only contain integer numbers between 0 and 255 to Uint8Arrays. By far biggest bottleneck within this process is converting strings: in most situations, we have endless lists of only a handful of unique string literals:

Counting unique strings and convering those to an indexed version is slow in JS.

Mind you, this conversion is absolutely necessary to keep things snappy!

Meanwhile, numpy has a built-in for that:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

What I want to do is create the type-inferred object server-side, in JSON. It would not contain typed arrays (not supported by JSON), but would save all the information (if it's a string or a number, and in the latter case the min/max number, whether it is all integers) to avoid having to test the whole array client-side before deciding what type to convert it to.

Furthermore, when looking at string arrays it could pre-compute the indexed arrays, using the above numpy function. This should result in a huge improvement for both size and speed for a number of reasons:

numpy is much faster in testing arrays than JS, so even for local use this should be faster
for static serving we'd compute this once and re-use the results, saving recomputing things
despite the added overhead of type information in the served JSON string, we should expect a net savings, at least for loom files with more than a few string attributes. The reasons for this are as follows:
- the resulting JSON string would have most strings replaced with numbers, so the uncompressed version would be much shorter - "Meningis not removed; Stage was correct;" is 43 characters, a single digit number just one char. Net result: faster string building on the server, and faster JSON parsing in the client.
- because we'd be indexing multiple attributes that are all strings, their resulting lists would become much more alike: sequences of single-digit numbers. That should lead to better gzip compression, leading to smaller data payloads

JobLeonard commented 7 years ago

Ok, so I just realised: if I do this now, I'll save myself a lot of time because I won't have to wait 20+ seconds for large loom datasets to load in the future for testing. And I test a lot, so those minutes add up quickly. So I'm going to spend some time on this just for that frustration alone (it will pay itself back in terms of time saved quickly enough).

JobLeonard commented 7 years ago

TL;DR: to keep performance high, we don't just use plain arrays for the data, but objects wrapping (ideally typed (ideally integer)) arrays with a bit of metadata. This is inferred at runtime, on the client-side.

Instead, we can infer this at on the server-side. Benefits:

numpy is much faster at testing arrays, so this will make the off-line client a lot more responsive too
the cloud server will send expanded static files, which usually will be compressed to almost the same size, so this will not be a significant extra burden on the server
- in the case of string metadata, the indexed arrays will likely be a net improvement in compression even

Previous issues with documentation of how things are implemented currently:

"Roadmap" - contains explanation of schema for client-side attribute, row and column data

"Support fetching multiple rows/columns" - explains how data is currently sent over

Schema:

{
  arrayType, // see below
  data, // (typed) array of data
  indexedVal, // look-up table (used for indexed strings)
  uniques: [{ val, count }, ...], // array of unique values and their count
  colorIndices: {
    // LUT or dictionary for color index values,
    // at most 20 values, and starting at 1
    mostFreq, 
  },
  min, // min value (in case of string: lexical first string)
  max, // max value (in case of string: lexical last string)
}

indexedVal is only used when there are less than 256 unique elements, in which case data will be a Uint8Array, and indexedVal an appropriate array to store the look-up data

Array "types" supported by the metadata:

string
float32
- float64 is truncated to float32, which is precise enough for rendering in the browser
uint8
uint16
uint32
int8
int16
int32
[x] expansion code/server fall-back:
- [x] minimum (numpy)
- [x] maximum value (numpy)
- [x] infer appropriate arrayType
- [x] test for integers (numpy) + use knowledge of max/min values,
- [x] count nr. of unique values (numpy)
- [x] apply indexedVal if appropriate,
- [x] using all of this knowledge, write out JSON object following above schema
[x] client side:
- [x] just load the JSON file directly into state tree, no conversion necessary
- [x] strip out useless conversion helper code

linnarsson-lab / loom-viewer

Move attribute type "inference" to the python side #95