linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Support fetching multiple rows/columns at once #84

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

Currently, we only support single-row fetching with GET /loom/{project}/{filename}/row/{row}. This severely slows down the page if we fetch many genes at once (which is going to happen). Explanation further below.

It would be much better if we could request multiple rows at once. To do so, we need a way to pass a list of numbers through the URL to Flask. Here is the Flask documentation for how to implement that type of request:

We’ll take an arbitrary number of elements separated by plus-signs, convert them to a list with a ListConverter class and pass the list of elements to the view function.

# myapp/util.py

from werkzeug.routing import BaseConverter

class ListConverter(BaseConverter):

    def to_python(self, value):
        return value.split('+')

    def to_url(self, values):
        return '+'.join(BaseConverter.to_url(value)
                        for value in values)

The rest of the documentation is here

Using the above method we could for example modify our existing routing scheme to a +-separated list of rows: GET /loom/Published/cortex.loom/row/1+2+3+4+5 would fetch the first five genes, for example. Existing URLs would work fine, since they'd be lists of length one.

The API would be updated the same way for columns.

@slinnarsson, if you're OK with this update to the server scheme, I'll assign myself to the whole issue (so including the python side), since it'll be easier to check if the whole thing is integrated properly.

Why individual GET requests slow down the page

This can really slow down the website when loading a whole bunch of genes quickly, and recall that Amit requested the ability to paste a list of genes to be fetched at once in sparklines. With the above method, all those genes are fetched one by one. First, there's the part where HTTP 1.1 is FIFO based, so we end up with twenty sequential, slow fetch operations. Then on the client side things get slowed down even further.

I can already test this in practice: I have a few URLs where I load the sparkline view with a large list of genes. Doing so freezes the tab for multiple minutes. However, once all the genes are fetched the page updates (for example switching between barplots and heatmap plots) are very fast. Even switching to another view and coming back is pretty fast.

I can (and will) try to optimise our React components so it only re-renders the newly fetched genes and keeps all the unchanged elements the same, but I don't think it'll help that much - the main cause for the slowdown is that all of these updates are separate, resulting in twenty roundtrips through React and Redux to re-render the page, nineteen of which are immediately discarded. I'm pretty sure we're accidentally quadratic (if not cubic or worse). So after the genes are downloaded switching views is pretty fast, because we mount all elements at once. Fetching all genes in one request would result in the same type of speed-up.

JobLeonard commented 7 years ago

So I just realised that I can't work on the stuff I was working on on my laptop, because I don't have the material from the last two weeks pushed to the server! Also, that would lead to merge conflicts.

So instead I'll look into this, and try fixing up the server while we are at it

JobLeonard commented 7 years ago

Slight change of schema: GET /loom/Published/cortex.loom/row/1+2+3+4+5 returns

[
  { idx: 1, data: [ ... ] },
  { idx: 2, data: [ ... ] },
  { idx: 3, data: [ ... ] },
  { idx: 4, data: [ ... ] },
  { idx: 5, data: [ ... ] },  
]

This will make putting the data in the right position easier client-side, and the code more self-documenting.

JobLeonard commented 7 years ago

So I discovered a majority of the time when fetching data was spent onwaiting for the server to start sending data:

screenshot_20170303_123106

The salient bit is the large fetch, which is about 300+ rows. It takes 16.45 seconds for 1.1MB (compressed from 60MB), of which less then a second was spent on the actual downloading.

After replacing the default json lib with ujson, and re-organising the way we fetch row data, I managed to get that down to... 14.39 seconds. Sigh.. screenshot_20170303_154707

I dunno, maybe it's my laptop that's causing the problem; I also didn't see any speed-ups with bigger chunks in the HDF5 files, even though Gioele says he does see an improvement