linnarsson-lab / loom-viewer

Tool for sharing, browsing and visualizing single-cell data stored in the Loom file format
BSD 2-Clause "Simplified" License
35 stars 6 forks source link

Expand loom into pickle files for static webserving #92

Closed JobLeonard closed 7 years ago

JobLeonard commented 7 years ago

Ok, so I completely forgot to make a separate issue for this, discussing what to do the "cleaning up global namespace" issue (which is unrelated)

https://github.com/linnarsson-lab/Loom/issues/89#issuecomment-286709342

https://github.com/linnarsson-lab/Loom/issues/89#issuecomment-286741390

The good news is that I almost finished implementing it! Woo! There's quite a speed boost, although sadly not the hoped-for order of magnitude.

Here is the time spent trawling through loom files when serving 461 randomised, out-of-order rows of genes on the cortex dataset of 3005 cells:

::1 - - [2017-03-28 16:30:46] "GET /loom/Published/cortex_repack.loom/row/<461 random genes>" 200 193459 0.534398

Same link when serving pickled data:

"GET /loom/Published/cortex_repack.loom/row/<461 random genes>" 200 189936 0.153744

I haven't tested with larger datasets yet because it takes hours to expand them, will let my laptop do so overnight and test tomorrow.

And semi-related:

The last two commands will make it easy to scan over the entire data folder and add expanded files when needed

JobLeonard commented 7 years ago

One thing I want to try is use joblib.persistence. It has specific optimisations for saving numpy arrays to disk and loading them, so it might be much faster (and compress better) than the built-in pickle tools for the rows and columns.

While it does not guarantee compatibility between various systems, the expansion should only be used locally anyway so that won't be an issue.

JobLeonard commented 7 years ago

Ok, so joblib isn't faster for serving objects (in fact, for pure Python objects it's slower, but for the rows/columns as NumPy arrays it's equally fast), but for storing NP arrays it's tens of times faster, so for now I'm using it for the rows and columns. It's basically the difference between letting my computer run for a few hours to pre-calc all datasets, or a few minutes.

Another thing we might investigate here is BLOSC, although this might be more interesting for the loom format in general:

http://blosc.org/

It seems to have HDF5 support:

https://hdfgroup.org/wp/2016/02/the-blosc-meta-compressor/

But h5py isn't on board:

https://github.com/h5py/h5py/issues/611

Sigh... however, there is also bcolz:

https://github.com/Blosc/bcolz

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes.

Well, that sounds like a suspiciously good fit... I'll look into this

(aside: @slinnarsson, maybe using bcolz will speed up some of your NumPy code too? I wouldn't be surprised if many of those are memory bound)

JobLeonard commented 7 years ago

Hmm, perhaps Zarr will be useful here too:

http://alimanfoo.github.io/2016/04/14/to-hdf5-and-beyond.html

In the example above, Bcolz is more than 10 times faster at storing (compressing) the data than HDF5.

Speed really makes a difference when working interactively with data, so I started using the bcolz.carray class where possible in my analyses, especially for storing intermediate data. However, it does have some limitations. A bcolz.carray can be multidimensional, but because Bcolz is not really designed for multi-dimensional data, a bcolz.carray can only be chunked along the first dimension. This means taking slices of the first dimension is efficient, but slicing any other dimension will be very inefficient, because the entire array will need to be read and decompressed to access even a single column of a matrix.

To explore better ways of working with large multi-dimensional data, I recently created a new library called Zarr. Zarr like Bcolz uses Blosc internally to handle all compression and decompression operations. However, Zarr supports chunking of arrays along multiple dimensions, enabling good performance for multiple data access patterns.

JobLeonard commented 7 years ago

I just realised something extremely dumb: why not just store the data as JSON files? Because it is much faster to save/load:

If you’re here for the short answer — JSON is 25 times faster in reading (loads) and 15 times faster in writing (dumps). https://konstantin.blog/2010/pickle-vs-json-which-is-faster/

(and that's the built-in JSON library, ujson is even faster)

On top of that, we're always converting to JSON anyway, so storing as JSON and loading that would save converting at server-time. That could make serving the data faster.

http://aripollak.com/pythongzipbenchmarks/

JobLeonard commented 7 years ago

Ok, I declare the JSON method the winner. Base case: 608 random genes in the biggest dataset. Without row expansion:

"GET /loom/Build%20161109/Forebrain_E9-E18.5.loom/row/<608 random genes>" 200 4832632 28.100091

With JSON expansion:

"GET /loom/Build%20161109/Forebrain_E9-E18.5.loom/row/<608 random genes>" 200 4056062 4.020271

Seven times faster! Joblib and pickle were around five times faster, so it wins. It's also smaller on disk, and I am really just compressing a plaintext JSON string as zip, so it can be read out by third parties and is not Python-specific.

On top of that, I added some code to the expansion functions that tests if all values are integers (using numpy to keep it fast) and converts them to such when that happens:

        # 64 is the chunk size, so probably the most cache
        # friendly option to batch over
        total_rows = self.shape[0]
        i = 0
        while i+64 < total_rows:
            row64 = self[i:i+64,:]
            for j in range(64):
                row = row64[j]
                # test if all values are integer
                row_mod = np.mod(row, 1)
                row_is_int = row_mod == 0
                is_int = np.all(row_is_int)
                if is_int:
                    row = row.astype(int)
                row = {'idx': i+j, 'data': row.tolist()}
                row_file_name = '%s/%06d.jsonz' % (row_dir, i+j)
                save_compressed_json(row_file_name, row)
            i += 64
        while i < total_rows:
            row = self[i,:]
            row_mod = np.mod(row, 1)
            row_is_int = row_mod == 0
            is_int = np.all(row_is_int)
            if is_int:
                row = row.astype(int)
            row = {'idx': i+j, 'data': row.tolist()}
            row_file_name = '%s/%06d.jsonz' % (row_dir, i)
            save_compressed_json(row_file_name, row)
            i += 1

The reason I'm doing this is because ujson appends .0 to all float values, even if they are integers. So converting to integers significantly shortens the JSON string (especialy for rows with mostly zeros and other single-digit values; they're almost cut in half: [ .. 0.0,0.0,0.0, .. ] to [ .. 0,0,0, .. ]). And that leads to a whole cascade of benefits:

Of course, we probably don't really have to worry about anyone fetching hundreds of genes in the client, so for more realistic loads this is going to be lightning fast!

EDIT: Copied the wrong terminal output... sadly we're not thirty times faster, but still pretty darn fast.

JobLeonard commented 7 years ago

After yet another optimisation pass (I figured out a way to build strings more efficiently) the serving time is now down to somewhere between 0.8 and 3 seconds

I also realised something else: if we turn all of these expanded values into gzipped JSON, then we have no Python-specific static files whatsoever, and can actually create a static server that doesn't use Python at all, if we ever get worried about performance.

However, the current server implementation in Python has the nice feature that is has seamless fallback to loading from the loom file if no precomputed values are present.

JobLeonard commented 7 years ago

So latest timing in the epic number of sparklines test:

Without expansioN: ~45 seconds between refresh and final render With expansion: ~17 seconds between refresh and final render

A big bottleneck here is actually the initialisation of the data on the client-side. I have an idea of how to optimise this (basically, move more precalculations to the server, and specifically to the expansion side), but I don't think we really need to for now. The server is an order of magnitude faster, with realistic loads it should survive, and because everything is expanded as a gzipped JSON file we could even switch to, say, a Go server that loads and serves the JSON as well as the tiles.

With that done the more pressing priority is the interface fixes, and then getting interactivity into our plots.

JobLeonard commented 7 years ago

So, I just tried deploying, and the issue is that loom cannot find the loom-datasets folder. A bit of sleuthing through the supervisor config files leads me to discover that the datasets are stored in /srv/datasets.

In order to expand all files, loom needs sudo access... except loom is not installed as root.

So for now I fixed it by chmoding the folder like so:

#! /bin/bash

OLD_LOOM_VERSION=$(loom version)

pip install -U loompy scipy==0.17.1

NEW_LOOM_VERSION=$(loom version)

if [ "$OLD_LOOM_VERSION" == "$NEW_LOOM_VERSION" ]; then
  echo ""
  echo "Version did not change, no need to restart server."
  echo "Old version:    $OLD_LOOM_VERSION"
  echo "New version:    $NEW_LOOM_VERSION"
  echo ""
  echo "If you just uploaded the new version, then perhaps"
  echo "wait a few minutes to ensure PyPi has made the "
  echo "most recent loompy version available."
else
  echo ""
  echo "Upgraded $OLD_LOOM_VERSION to $NEW_LOOM_VERSION"
  echo "Expanding new loom files, if any"
  echo ""
  sudo chmod -R 777 /srv/datasets
  loom --dataset-path /srv/datasets exp_all
  echo ""
  echo "Restarting server"
  echo ""
  sudo /home/ubuntu/anaconda2/bin/supervisorctl restart loom-server
fi

It is happily expanding the loom files now.

@slinnarsson, is there any safety risk to changing the folder permissions like that?

JobLeonard commented 7 years ago

BTW, it's still chugging away, but at the rate of 1000 cells = 2 minutes. (This will obviously not scale to a million cells at this rate - that's around 33 to 34 hours to expand one loom file.)

There's 257 loom files listed

Average size... err... I guess around 4k. Let's round this up: 260 * 10 minutes / 60 is about 43 hours.

Well, guess I can come pick up my laptop in a few days then...

JobLeonard commented 7 years ago

image

NOOOOOOO!!!

Ok, so at least it crashed when starting expansion, so correcting for this should be easy.

JobLeonard commented 7 years ago
2017-04-27 11:01:14,256 - INFO - Looking for Midbrain_E9-E18.5.loom
2017-04-27 11:01:14,256 - INFO - Looking for /srv/datasets/Build 161109/Midbrain_E9-E18.5.loom
2017-04-27 11:01:14,256 - INFO - Found Midbrain_E9-E18.5.loom at full path

So I cannot find this file on the Loom website:

image

So presumably, that file is corrupted. I have moved it to /home/ubuntu/corrupted-datasets for now, since it wasn't accessible on the website anyway. @slinnarsson, was there anything special about this file?