linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Move attribute type "inference" to the python side #95

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

This is an enhancement we can apply if the browser becomes too slow to load large datasets. It only applies to receiving a dataset for the first time! In other words: it reduces time between clicking a dataset and being able to do something with the plots.

Currently, one remaining bottle-neck on this front is converting the arrays passed to the client to efficient TypedArrays - for example, arrays that only contain integer numbers between 0 and 255 to Uint8Arrays. By far biggest bottleneck within this process is converting strings: in most situations, we have endless lists of only a handful of unique string literals:

image

image

image

Counting unique strings and convering those to an indexed version is slow in JS.

Mind you, this conversion is absolutely necessary to keep things snappy!

Meanwhile, numpy has a built-in for that:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

What I want to do is create the type-inferred object server-side, in JSON. It would not contain typed arrays (not supported by JSON), but would save all the information (if it's a string or a number, and in the latter case the min/max number, whether it is all integers) to avoid having to test the whole array client-side before deciding what type to convert it to.

Furthermore, when looking at string arrays it could pre-compute the indexed arrays, using the above numpy function. This should result in a huge improvement for both size and speed for a number of reasons:

JobLeonard commented 7 years ago

Ok, so I just realised: if I do this now, I'll save myself a lot of time because I won't have to wait 20+ seconds for large loom datasets to load in the future for testing. And I test a lot, so those minutes add up quickly. So I'm going to spend some time on this just for that frustration alone (it will pay itself back in terms of time saved quickly enough).

JobLeonard commented 7 years ago

TL;DR: to keep performance high, we don't just use plain arrays for the data, but objects wrapping (ideally typed (ideally integer)) arrays with a bit of metadata. This is inferred at runtime, on the client-side.

Instead, we can infer this at on the server-side. Benefits:

Previous issues with documentation of how things are implemented currently:

"Roadmap" - contains explanation of schema for client-side attribute, row and column data

"Support fetching multiple rows/columns" - explains how data is currently sent over

Schema:

{
  arrayType, // see below
  data, // (typed) array of data
  indexedVal, // look-up table (used for indexed strings)
  uniques: [{ val, count }, ...], // array of unique values and their count
  colorIndices: {
    // LUT or dictionary for color index values,
    // at most 20 values, and starting at 1
    mostFreq, 
  },
  min, // min value (in case of string: lexical first string)
  max, // max value (in case of string: lexical last string)
}

indexedVal is only used when there are less than 256 unique elements, in which case data will be a Uint8Array, and indexedVal an appropriate array to store the look-up data

Array "types" supported by the metadata: