miranov25 / RootInteractive

5 stars 12 forks source link

Data compression server -> client for CDS #96

Open miranov25 opened 3 years ago

miranov25 commented 3 years ago

Data compression - to reduce data transfer between server -> client

Lossy compression of CDS columns

Lossles compression

The same compression on server - > corresponding decompression on client To define which compression to use

miranov25 commented 3 years ago

This issue will be connected also to the joins on client - Memory overhead

Example columns for slice histogram

Could be represented as 2 tables (map of arrays -> CDS) with common bin index Using common bin index we can query

miranov25 commented 3 years ago

Related toLossless compression of binary data

5792 - https://github.com/bokeh/bokeh/issues/5792

https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476

Bokeh now supports an unencoded pure binary transfer of arrays. This has afforded a huge speedup over the base64 encoding, e.g. no possible to interactively scrub changes to ~1500x1500 images. If someone really needs compression on top of this, its now possible to do it themselves: send compressed data as a binary array, and decompress in BokehJS with a custom extension. There is also the possibility of using DataShader with Bokeh for very large images (c.f. recent demo of 4 8000x8000 images simultaneously being linked zoomed).

with these options, I can't see a justification for adding this complexity directly to core bokeh, so closing this now.

miranov25 commented 3 years ago

Hello @pl0xz0rz

You can check issue https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476 I did not fully understand what functionality had author of comment (https://github.com/bryevdv) in mind, but we should check.

Lets do experiments, so we can send also request to bokeh, explaining gain in data transfer and memory usage. I'm sure it is worth of increase of complexity. This we should demonstrate in toy example.

We can try to summarize requirements and gains in form of google documents (including figures) Our histogram dashboard use case should be used as an example.

miranov25 commented 3 years ago

Example data to compress:

coding of histograms - to generate numpy histo

miranov25 commented 3 years ago

Example test to check entropy of the data and compresion:

https://github.com/miranov25/RootInteractive/blob/9526a1cef628aa266ac19fba1b9f5742a87c926c/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py

miranov25 commented 3 years ago

Algorithm for entropy coding:

      if (User coding){
          float-> integer
          entropy coding
      }else{
       1. Unique gives amount of distinct value
       1. if Unique<<size
           1. Entropy for value_counts
           2. Entropy for delta.value-counts
           3. Use coding with smaller entropy
   }
miranov25 commented 3 years ago

Example use case dashboard - DCA resolution maps - possible data reduction

Task description: https://alice.its.cern.ch/jira/browse/PWGPP-586

Dashboard source code - Jupyter notebook to add (protected gitlab)

output_file("dcaResol0.html")    
optionsAll={"colorZvar":"tpcSignalMMean"}
figureArray = [
    [['qPtMean'], ['dcaRTPCN_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCPull_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCPull_RMS'], optionsAll],
    #
    [['qPtMean'], ['dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCPullPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCPullPrim_RMS'], optionsAll],
    #
    [['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
    {"size": 4}
]
widgetParams=[
    ['range', ['tglMean']],
    ['range', ['tpcSignalMMean']],
    ['range', ['qPtMean']],
    ['range', ['alphaMean']],
]
tooltips = [("qPtMean", "@qPtMean")]
widgetLayoutDesc=[ [0,1],[2,3], {'sizing_mode':'scale_width'} ]
figureLayoutDesc=[
    [0,1,2,3, {'plot_height':250,"commonY":0,'x_visible':1}],
    [4,5,6,7, {'plot_height':250,"commonY":0,'x_visible':1}],
    [8,9,10,11, {'plot_height':250,"commonY":4,'x_visible':1}],
    {'plot_height':250,'sizing_mode':'scale_width',"legend_visible":True}
]
fig=bokehDrawSA.fromArray(dfDCAMap0, "entries>100", figureArray, widgetParams,layout=figureLayoutDesc,tooltips=tooltips,sizing_mode='scale_width',widgetLayout=widgetLayoutDesc, nPointRender=1000,rescaleColorMapper=True)
miranov25 commented 3 years ago

Lossy +lossles (zlib) compression of the CDS - histogram emulation:

Real data use case (5Dimensional DCA histogram) 24. 10^6 bins: https://github.com/miranov25/RootInteractive/blob/d79a5433ad52497560e5311aad911592f5e404c9/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L46

miranov25 commented 3 years ago

Similar approach - lossy->lossles

https://stackoverflow.com/questions/22400652/compress-numpy-arrays-efficiently/29111682#29111682

1.) Noise is in-compressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

2.) Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

3.) A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1]._

miranov25 commented 3 years ago

Using histogram coding - factor 10^-4 compression for indices (compared 10^-3 using values itself)

miranov25 commented 3 years ago

Replace in panda slow -checking another library for coding - attempt to use vaex instead of the pandas 0

miranov25 commented 3 years ago

Implementing relative precission rounding and time benchamrk:

Relative compression - binary with nBits: Compression factor (zlib) was compared with data entropy

Comparison:

miranov25 commented 3 years ago

Trying to use pako to inflate data

npm i pako
miranov25 commented 3 years ago

96 - compression array as a pipeline of action + unit test of compression

miranov25 commented 3 years ago

Data source compression pseudo-algorithm:

miranov25 commented 3 years ago

DCA example example - need of joins

variables:

To represent it we should have 2 CDS

  # qPt,tgl.mdEdx.alpha, dCAR
    rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
    bins = [50, 20, 20, 12, 400]
    H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)

  # qPt,tgl.mdEdx.alpha, dCAZ
    rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
    bins = [50, 20, 20, 12, 100]
    H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)
miranov25 commented 3 years ago

User interface for joins:

Exception handling:

miranov25 commented 3 years ago

Problems in Jupyter notebook - TO be updated

pako package needed to run compression is not added to the cell. It is simular as discussed in the https://discourse.bokeh.org/t/bokeh-katex-jupyter/4232

miranov25 commented 3 months ago

New functionality for client-side compression is currently under development.

See documentation in https://github.com/miranov25/RootInteractive/blob/master/RootInteractive/InteractiveDrawing/bokeh/doc/READMEcompression.md

New functionality for client-side compression is currently under development. For more details, refer to this issue.

Issue closed.