nanshe-org / nanshe_workflow

Provides a Jupyter workflow for running nanshe.
Apache License 2.0
3 stars 4 forks source link

Compressing Bokeh Plots #24

Closed jakirkham closed 7 years ago

jakirkham commented 7 years ago

After a recent run, we discovered that our plots are taking up quite a bit of space still even when encoding every large piece of data to base64.

When exploring the contents of a particular run, we found the traces were the largest part of the data. In this particular instance, we had identified ~500 ROIs in ~12,000 frames of data. Note that the next largest portion of data is likely the image projection we show the ROIs on, which is often 512x512 or smaller due to registration or artifact removal. When comparing the traces to the rest of the data in this file, it made up ~95% of the total file size.

Even when considering data with sparser ROI detection, it is not atypical to have at least 50 ROIs in an equivalent time range (~10000 frames or more). This makes the lower bound a case where the traces are still ~66% of the total file size. Considering a wider field of view, this trend is likely to continue with traces dominating the data as the density of ROIs will be largely similar (each one adding another 10,000 frame trace) whereas additional points add relatively little to the image.

Therefore to get any significant size reduction in the plots either now or in the future, we must concentrate our efforts on reducing the size of the stored traces. In other words, we need to figure out the optimal way to compress the traces. Though certainly that work can be extended to other elements like the image projection and ROIs as appropriate. There are four main challenges for a successful compression implementation.

  1. Compression format needs to work in Python and JavaScript.
  2. Compression ratio must be sizable and consistent.
  3. Compression/decompression needs to be fast.
  4. Implementation should be agnostic in terms of compression used.

The importance of each of these can easily be understood. 1 is a bare minimum requirement as compression must happen in Python and decompression must happen in JavaScript. 2 is necessary to provide payoff for the effort particular in solving 1 (w.r.t JavaScript). 3 is necessary for interactivity. 4 is necessary to remain flexible as future needs change.

For the most part, 1 will be a pretty restrictive constraint. We also need to be pretty careful that whatever dependency is introduced is pretty reliable. This tends to leave us with things like zip, gzip, bzip2, and lzma. Though it looks like snappy might be an option, but initial results were not promising. Similar discussion has come up with Bokeh before, it would be great if we could find some common ground that would satisfy both of our needs.

xref: https://github.com/bokeh/bokeh/issues/5792 xref: https://github.com/bokeh/bokeh/issues/2204#issuecomment-103504754

jakirkham commented 7 years ago

Just trying run of the mill command line utilities on some data I have to get an initial idea of what might work well. There seem to be JavaScript and Python implementations of all of these. Potentially we might be able to get the browser to work with some like gzip.

All of this is just being run on the data needing the compression and not the full JSON or HTML file. So the result on the full file may vary. However, given how much of the size is occupied by this data alone, the differences are likely negligible. Also this is just one dataset, will need to look at others to get a full picture. That being said, this data shares a common theme of being sparse in activity and mostly having some noisy fluctuations around an equilibrium point.

compression time (MB/s) ratio
ZIP 17.72 1.361
GZIP 18.82 1.360
BZIP2 9.703 1.397
XZ 1.108 1.601

While XZ clearly can create the smallest size, it takes quite a bit of time to do so. Potentially if it were run once in the beginning, it could be worthwhile. However, this would make load time pretty slow and may make a user antsy about whether it is working. So it is out.

Running BZIP2 is only slightly better than the other too and takes a little longer to run. May still be worth considering.

However, GZIP and ZIP actually do a reasonable job compression the data quickly. As they both compressed the data similarly, GZIP would win on time alone. Even though it does not compress the data to the smallest size GZIP does a decent job compressing the data and does so reasonably quickly.

Thus far, GZIP looks promising. Will try some more data to see if that trend holds.

jakirkham commented 7 years ago

Could leverage the changes in PR ( https://github.com/nanshe-org/nanshe_workflow/pull/27 ) to try out compression algorithms.

jakirkham commented 7 years ago

Have a WIP PR ( https://github.com/nanshe-org/nanshe_workflow/pull/29 ), which implements ZLIB (ZIP) compression/decompression of the traces in the plots to cutdown on the HTML size.

jakirkham commented 7 years ago

Attempted LZMA compression in PR ( https://github.com/nanshe-org/nanshe_workflow/pull/36 ).


Edit:

Ran into some issues that are largely resolved, but they should be noted as caveats. They mainly pertain to the JavaScript implementation used (LZMA-JS), which seems to be the best option out there, but has a few quirks. These are noted below.

First LZMA-JS doesn't support XZ. So we had to compress using the old LZMA format, which means we lose integrity checks and custom filters. These could be used to verify the data was compressed correctly and possibly compress it further. Second LZMA-JS in stable releases is hugely inefficient by not using a TypedArray except for some reverted commit. When using the reverted commit ( https://github.com/nmrugg/LZMA-JS/commit/b4a5e24aebe964ff92287675bad6880d1b2ef88a ) with TypedArray/Uint8Array support, performance improves drastically, but it is unclear whether this will be supported again in the future or not. ( https://github.com/nmrugg/LZMA-JS/pull/42 )

jakirkham commented 7 years ago

Looked for a decent BZIP2 implementation for decompression in JavaScript. However, the options that surfaced were not great either because they were very old, had viral licensing (e.g. GPL), or didn't quite fit our use case for any number of reasons (e.g. doesn't work with Uint8Arrays, JavaScript bindings to libbzip2, no simple minified version usable via CDN, etc.). This combined with the fact that most benchmarks show LZMA outperforming BZIP2 in compression size and speed makes it hard to justify spending time trying to get any one of these suboptimal solutions to work even for a test.

ref: https://github.com/cscott/compressjs ref: https://github.com/antimatter15/bzip2.js ref: https://github.com/kevva/decompress-bzip2 ref: https://github.com/kirilloid/bzip2-js

jakirkham commented 7 years ago

Should note some general trends that we have seen from our data.

ZIP compression offers a pretty decent base level compression even on binary data as it is fast to compress and decompress and does a decent job at reducing size. Bumping the compression level from the default setting doesn't seem to make a noticeable difference in compression size or decompression time. That JavaScript implementation is clean fast and holds tightly to the ZLIB C-implementation. No real tweaking was required.

BZIP2 has thus far been surprisingly uncompetitive as it has been significantly slower than ZIP and offering similar or worse quality of compression for the same data (note this differs from BZIP2 on benchmarking data where it offers better compression quality). While it is possible to bump up the level of compression that BZIP2 does, this ends up coming at a non-negligible time penalty on decompression. This is all done Python side though and does not factor in using JavaScript to do the decompression. Namely this has happened because there aren't great options for doing this in JavaScript, which is really just another downside.

LZMA has offered a small improvement on size compared to ZIP, but comes at a cost of taking longer to compress and decompress. Though this can be made up for mostly by maximizes the compression level, which increases compression time a small amount, but makes for a reduced decompression time. It still tends to a little bit more time that ZIP, but may break even given the size reduction. However, some of the decisions that developers are making with that library are a little disconcerting like the decision to revert support for Uint8Arrays, which basically makes it unusable for us. Hopefully they will reverse this decision. In the interim, we are left to use a particular commit to get the job done.

Considering the importance of size, a particularly important case is one of our plots (representative of typical data to come) that initially was ~28MB, after ZIP compression was ~25MB, and after LZMA compression was ~24MB. Getting it down under 25MB is pretty crucial in order to make the HTML plot an acceptable attachment on most email services. Some even have a lower requirement of 20MB. So even though this 1MB reduction seems small, this is actually the most import 1MB. In fact, there is a good argument that there is still more room for improvement.

While there are other factors to consider with decompression like memory usage during, these weren't profiled directly by us, but other sources provide values. These basically seem acceptable even on the lowest end mobile devices from these numbers. So have been ignored in the comparison above.


Refs


Attachment Refs


ref: https://support.google.com/mail/answer/6584 ref: https://help.yahoo.com/kb/SLN5673.html ref: https://support.office.com/en-us/article/Send-large-files-to-other-people-7005da19-607a-47d5-b2c5-8f3982c6cc83 ref: https://technet.microsoft.com/en-us/library/exchange-online-limits.aspx#MessageLimits


Compression Benchmark Refs


ref: https://catchchallenger.first-world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO ref: http://tukaani.org/lzma/benchmarks.html

jakirkham commented 7 years ago

While there is room to compress the plots further and it is still of value, the bulk of the plot size is due to the traces for any decent sized dataset. So further investigation should focus on particular implementations that work better at compressing traces and how to make them preserve a nice user experience. This is in contrast to this issue, which generally seeked out a method of compressing the plots and all objects in them. Also while we can explore compression of other objects in the plots (e.g. the projection), these are a bit more involved and come with a smaller gain in typical situations (short time scale data may be a different story). In other words, more target problems will be focused on with clear goals like switching to LZMA compression ( https://github.com/nanshe-org/nanshe_workflow/pull/36 ), compressing other smaller plot objects, and/or getting compression to occur at load time instead of selection time as opposed to solving compression. As such it makes sense to close this and open new threads to follow these more specific topics.