miranov25 commented 3 years ago

Data compression - to reduce data transfer between server -> client

lossy compression first
lossless compression

Lossy compression of CDS columns

for integer not needed if lossless compression will follow
- before implementing lossless compression round to the smallest
- e.g int8, int16
- or categories
identify if categories
- could be boolean distinct float values
- to check categories - how they are sent in bokeh
optional user defined strategies for floats:
- relative precision
- absolute precision
- interval - linear transform
  - [xbeing, xend, nbins]
optional delta compression - can be automatic
- Use value-previous value

Lossles compression

The same compression on server - > corresponding decompression on client To define which compression to use

miranov25 commented 3 years ago

This issue will be connected also to the joins on client - Memory overhead

Example columns for slice histogram

Histogram properties: mean, median, rms, fit values for slice
- Fits with similar granularity
Histogram content - array for slice

Could be represented as 2 tables (map of arrays -> CDS) with common bin index Using common bin index we can query

miranov25 commented 3 years ago

Related toLossless compression of binary data

5792 - https://github.com/bokeh/bokeh/issues/5792

https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476

Bokeh now supports an unencoded pure binary transfer of arrays. This has afforded a huge speedup over the base64 encoding, e.g. no possible to interactively scrub changes to ~1500x1500 images. If someone really needs compression on top of this, its now possible to do it themselves: send compressed data as a binary array, and decompress in BokehJS with a custom extension. There is also the possibility of using DataShader with Bokeh for very large images (c.f. recent demo of 4 8000x8000 images simultaneously being linked zoomed).

with these options, I can't see a justification for adding this complexity directly to core bokeh, so closing this now.

miranov25 commented 3 years ago

Hello @pl0xz0rz

You can check issue https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476 I did not fully understand what functionality had author of comment (https://github.com/bryevdv) in mind, but we should check.

Lets do experiments, so we can send also request to bokeh, explaining gain in data transfer and memory usage. I'm sure it is worth of increase of complexity. This we should demonstrate in toy example.

We can try to summarize requirements and gains in form of google documents (including figures) Our histogram dashboard use case should be used as an example.

miranov25 commented 3 years ago

Example data to compress:

coding of histograms - to generate numpy histo

x
- x=[0,0,0, 2.5,2.5,2.5, 5.0, 5.0 ,5.0]
- entropy <1
y
- y={1.,2.,3.,1.,2.,3.,1.,2.3}
- entropy <1
weight
- whatewer

miranov25 commented 3 years ago

Example test to check entropy of the data and compresion:

https://github.com/miranov25/RootInteractive/blob/9526a1cef628aa266ac19fba1b9f5742a87c926c/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py

3D random normal distribution: *https://github.com/miranov25/RootInteractive/blob/9526a1cef628aa266ac19fba1b9f5742a87c926c/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L6
edges entropy:
- 3.1 bits
delta edges entropy
- 0 bit
histogram content entropy:
- 3 bits

miranov25 commented 3 years ago

Algorithm for entropy coding:

      if (User coding){
          float-> integer
          entropy coding
      }else{
       1. Unique gives amount of distinct value
       1. if Unique<<size
           1. Entropy for value_counts
           2. Entropy for delta.value-counts
           3. Use coding with smaller entropy
   }

miranov25 commented 3 years ago

Example use case dashboard - DCA resolution maps - possible data reduction

See - https://rootinteractive.web.cern.ch/RootInteractive/data/PWGPP-586/dcaMapResol.ipynb

Data reduction 1.9 GBy->50 MBy expected using combination of compression and joins
similar data compression lossy, relative resolution + lossless ) used in the AliRoot
in example below only subset of data used
data volume in dasboard example:
- https://rootinteractive.web.cern.ch/RootInteractive/data/PWGPP-586/dcaResol0.html
- 50 Mby on client using subset of data (5.10^4 out of 24.10^4)
  - after proposed compression (3 bits instead of 4 Bytes)
- Including raw histograms - current format
  - 1. 10^6 bins x 20 columns x 4 Bytes ~ 1.9 GBy
      ```'treeMap->GetEntries()100204.```
  - Using compressed data and join
  - 1. 10^6 bins x 4 columns x 4 Bites + 24. 10^4 bins x 20 columns x 1 Byte- ~ 50 MBy

Task description: https://alice.its.cern.ch/jira/browse/PWGPP-586

Input - 54 5 dimensional histograms (4 of them used in this dashboard)
- 54 (ND histos) x 24. 10^6 bins
- variables:
  - dcaRTPCN, 'dcaZTPCN, dcaZTPCPull, dcaRTPCPull
- bins:
- qPt, tgl, alpha, tpcSignal
- derived statitics variable:
- Mean, RMS, LTM, binMedian, MeanG (gauss fit), RMSG,
- isOK, isFitValid
https://rootinteractive.web.cern.ch/RootInteractive/data/PWGPP-586/dcaResol0.html

Dashboard source code - Jupyter notebook to add (protected gitlab)

output_file("dcaResol0.html")    
optionsAll={"colorZvar":"tpcSignalMMean"}
figureArray = [
    [['qPtMean'], ['dcaRTPCN_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCPull_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCPull_RMS'], optionsAll],
    #
    [['qPtMean'], ['dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCPullPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCPullPrim_RMS'], optionsAll],
    #
    [['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
    [['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
    {"size": 4}
]
widgetParams=[
    ['range', ['tglMean']],
    ['range', ['tpcSignalMMean']],
    ['range', ['qPtMean']],
    ['range', ['alphaMean']],
]
tooltips = [("qPtMean", "@qPtMean")]
widgetLayoutDesc=[ [0,1],[2,3], {'sizing_mode':'scale_width'} ]
figureLayoutDesc=[
    [0,1,2,3, {'plot_height':250,"commonY":0,'x_visible':1}],
    [4,5,6,7, {'plot_height':250,"commonY":0,'x_visible':1}],
    [8,9,10,11, {'plot_height':250,"commonY":4,'x_visible':1}],
    {'plot_height':250,'sizing_mode':'scale_width',"legend_visible":True}
]
fig=bokehDrawSA.fromArray(dfDCAMap0, "entries>100", figureArray, widgetParams,layout=figureLayoutDesc,tooltips=tooltips,sizing_mode='scale_width',widgetLayout=widgetLayoutDesc, nPointRender=1000,rescaleColorMapper=True)

miranov25 commented 3 years ago

Lossy +lossles (zlib) compression of the CDS - histogram emulation:

edit - Fix - in original post I used 4 time smaller number of bins.

Real data use case (5Dimensional DCA histogram) 24. 10^6 bins: https://github.com/miranov25/RootInteractive/blob/d79a5433ad52497560e5311aad911592f5e404c9/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L46

Columns compression using zlib as in https://stackoverflow.com/questions/47057832/use-zlib-js-to-decompress-python-zlib-compress

Data size measured pickle

# https://stackoverflow.com/questions/47057832/use-zlib-js-to-decompress-python-zlib-compress
import zlib
import pickle
dfD=df.shift(-1)
compress0 = zlib.compress(df["qPtCenter"].to_numpy())
compressD = zlib.compress((dfD["qPtCenter"]-df["qPtCenter"])[0:-1].round(5).to_numpy())
print(len(pickle.dumps(df["qPtCenter"])),len(pickle.dumps(compress0)), len(pickle.dumps(compressD)),len(pickle.dumps(compressD))/len(pickle.dumps(df["qPtCenter"])))
->
192000662 280119 187263 0.0009753247621614972

Observations:

1 per-mile compression of the data volume comparing uncompressed and compressed data
- 192MBy (28.10^6 x 8 By) -> 187 kBy
No big difference between compressing of original data and delta data
- 280 kBy-> 187 kBy

miranov25 commented 3 years ago

Similar approach - lossy->lossles

https://stackoverflow.com/questions/22400652/compress-numpy-arrays-efficiently/29111682#29111682

we will follow similar strategy, but instead de-noising we will sample noise

1.) Noise is in-compressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

2.) Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.

3.) A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1]._

miranov25 commented 3 years ago

Using histogram coding - factor 10^-4 compression for indices (compared 10^-3 using values itself)

similar compression as before but using characters to code categories
- replacing values by code of values
- factor 10 x better than coding values itself (or categories)
Compression using int8 2 times better as using int16
compressing codes and compressing delta codes -similar compression factor

Problem - replace values by code ~ 5 s

to be solved

valuesOrig=df["qPtCenter"].unique()
valuesReplace=np.arange(valuesOrig.size)
df["qPtCenterCode8"]=df["qPtCenter"].replace(valuesOrig,valuesReplace).astype("int8")
df["qPtCenterCode16"]=df["qPtCenter"].replace(valuesOrig,valuesReplace).astype("int16")
diffqPt=df["qPtCenterCode8"].diff()[1:].astype("int8")
size0=len(pickle.dumps(df["qPtCenter"].to_numpy()))
size1=len(pickle.dumps(zlib.compress(df["qPtCenter"].to_numpy())))
sizeC8=len(pickle.dumps(zlib.compress(df["qPtCenterCode8"].to_numpy())))
sizeC16=len(pickle.dumps(zlib.compress(df["qPtCenterCode16"].to_numpy())))
sizeD=len(pickle.dumps(zlib.compress(diffqPt.to_numpy())))
print(size0,size1,sizeC,sizeC16,sizeD,sizeC/size0)
...
192000161 280119 23456 46799 23515 0.00012216656422491228

miranov25 commented 3 years ago

Replace in panda slow -checking another library for coding - attempt to use vaex instead of the pandas 0

not full functionality of pandas in vaex - replace is missing , similar map is quite new - https://github.com/vaexio/vaex/issues/502
- for the moment map only for dictionary

df query

%%time
df["tglCenter"].unique()
CPU times: user 110 ms, sys: 195 µs, total: 110 ms
Wall time: 108 ms

vaex query

%%time
dfv["tglCenter"].unique()
CPU times: user 477 ms, sys: 6.17 ms, total: 483 ms
Wall time: 62.6 ms

miranov25 commented 3 years ago

Implementing relative precission rounding and time benchamrk:

Relative compression - binary with nBits: Compression factor (zlib) was compared with data entropy

compression size https://github.com/miranov25/RootInteractive/blob/966fcdcc65677e4d942799bcc141a51bd278b3bb/RootInteractive/Tools/compressArray.py#L6

Comparison:

https://github.com/miranov25/RootInteractive/blob/966fcdcc65677e4d942799bcc141a51bd278b3bb/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L90

test_Compression0()
qPtCenter 50 24000159 23456 0.0009773268585428956
tglCenter 20 24000159 24566 0.0010235765521386755
mdEdxCenter 20 24000159 87640 0.0036516424745352727
V 100 24000159 81611 0.0034004358054461224
alphaCenter 12 24000159 128028 0.005334464659171633
mean 389 48000159 48436 0.001009079990755864
rms 770 48000159 86435 0.0018007232017710607
weight 1139 48000159 13761129 0.28668923784189965
weightR5 115 24000159 7455282 0.3106346920451652
weightR7 329 48000159 12065507 0.25136389652375946
weightR8 529 48000159 12904398 0.26884073446506707
mitest_roundRelativeBinary(df)
0.09624060780074667 0.06656287196095499 0.045204186523616366 0.03675160498174379
0.07041459043666662 0.06584685098152705 0.06259448837842638 0.05465214964511605
4.506533787946664 4.214198462817731 4.006047256219288 3.497737577287427

miranov25 commented 3 years ago

Trying to use pako to inflate data

npm i pako

miranov25 commented 3 years ago

96 - compression array as a pipeline of action + unit test of compression

Compression as pipeline of action implemented RootInteractive/Tools/compressArray.py
Unit test RootInteractive/InteractiveDrawing/bokeh/test_Compression.py

cat testCompression.py | grep -P "(def test|actionArray=)"

def test_CompressionSequence0(arraySize=10000):
actionArray=[("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSequenceRel(arraySize=255,nBits=5):
actionArray=[("relative",nBits), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSequenceAbs(arraySize=255,delta=0.1):
actionArray=[("delta",delta), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSample0(arraySize=10000,scale=255):
actionArray=[("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleRel(arraySize=10000,scale=255, nBits=7):
actionArray=[("relative",nBits), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleDelta(arraySize=10000,scale=255, delta=1):
actionArray=[("delta",delta), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleDeltaCode(arraySize=10000,scale=255, delta=1):
actionArray=[("delta",delta), ("code",0), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8"),("decode",0)]

miranov25 commented 3 years ago

Data source compression pseudo-algorithm:

list of pairs:
(regular expression, actionArray)
for columns in columns:
- for pair in pars
  - if regular expression compress columns
javascript function columns to be defined later:
- functions using string
- variables defined

miranov25 commented 3 years ago

DCA example example - need of joins

variables:

e.g DCAr
explanatory variables: *qPt, tgl, mdEdx, alpha
- in total 50 2020* 12 (240000)-> bins in explanatory variables x 100 bin (2 400000) in variable of interest space In Root trees we have 240000

To represent it we should have 2 CDS

one with statistic in explanatory variables bins (240000xncolumns)
- index, varBin, varCenters, binMean,+ mean, rms, binMedian ....
one with histogram content
- (24000000 x ncolumns)
- weight, index -> pointing

  # qPt,tgl.mdEdx.alpha, dCAR
    rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
    bins = [50, 20, 20, 12, 400]
    H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)

  # qPt,tgl.mdEdx.alpha, dCAZ
    rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
    bins = [50, 20, 20, 12, 100]
    H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)

miranov25 commented 3 years ago

User interface for joins:

to be consulted with Panda and SQL tableLeft - (DF on server/ CDS on client), tableRight (DF on server/ CDS on client), indexName (string)

Exception handling:

to be parametrizatble by user - link in Panda
- NaN or skip
- in client not client not cleat to keep track of NaN -
  - by default goes to overflow bin in histograms

miranov25 commented 3 years ago

Problems in Jupyter notebook - TO be updated

pako package needed to run compression is not added to the cell. It is simular as discussed in the https://discourse.bokeh.org/t/bokeh-katex-jupyter/4232

Exported html file is workig well

The "same" code in jupyter notebok failing

error message


Uncaught ReferenceError: pako is not defined
at CDSCompress.inflateCompressedBokehBase64 (eval at append_javascript (main.min.js:45772), <anonymous>:485:23)
at CDSCompress.inflateCompressedBokehObjectBase64 (eval at append_javascript (main.min.js:45772), <anonymous>:513:37)
at CDSCompress.initialize (eval at append_javascript (main.min.js:45772), <anonymous>:469:20)
at CDSCompress.finalize (cdn.bokeh.org/bokeh/release/bokeh-2.2.3.min.js:180)

miranov25 commented 5 months ago

New functionality for client-side compression is currently under development.

See documentation in https://github.com/miranov25/RootInteractive/blob/master/RootInteractive/InteractiveDrawing/bokeh/doc/READMEcompression.md

New functionality for client-side compression is currently under development. For more details, refer to this issue.

Issue closed.

miranov25 / RootInteractive

Data compression server -> client for CDS #96

Data compression - to reduce data transfer between server -> client

Lossy compression of CDS columns

Lossles compression

This issue will be connected also to the joins on client - Memory overhead

5792 - https://github.com/bokeh/bokeh/issues/5792

Example data to compress:

Example test to check entropy of the data and compresion:

Algorithm for entropy coding:

Example use case dashboard - DCA resolution maps - possible data reduction

Data reduction 1.9 GBy->50 MBy expected using combination of compression and joins

Task description: https://alice.its.cern.ch/jira/browse/PWGPP-586

Dashboard source code - Jupyter notebook to add (protected gitlab)

Lossy +lossles (zlib) compression of the CDS - histogram emulation:

Observations:

Similar approach - lossy->lossles

Using histogram coding - factor 10^-4 compression for indices (compared 10^-3 using values itself)

Replace in panda slow -checking another library for coding - attempt to use vaex instead of the pandas 0

Implementing relative precission rounding and time benchamrk:

Comparison:

Trying to use pako to inflate data

96 - compression array as a pipeline of action + unit test of compression

DCA example example - need of joins

User interface for joins:

Exception handling:

Problems in Jupyter notebook - TO be updated

miranov25 / RootInteractive

Data compression server -> client for CDS #96

Data compression - to reduce data transfer between server -> client

Lossy compression of CDS columns

Lossles compression

This issue will be connected also to the joins on client - Memory overhead

5792 - https://github.com/bokeh/bokeh/issues/5792

Example data to compress:

Example test to check entropy of the data and compresion:

Algorithm for entropy coding:

Example use case dashboard - DCA resolution maps - possible data reduction

Data reduction 1.9 GBy->50 MBy expected using combination of compression and joins

Task description: https://alice.its.cern.ch/jira/browse/PWGPP-586

Dashboard source code - Jupyter notebook to add (protected gitlab)

Lossy +lossles (zlib) compression of the CDS - histogram emulation:

Observations:

Similar approach - lossy->lossles

Using histogram coding - factor 10^-4 compression for indices￼ (compared 10^-3 using values itself)

Replace in panda slow -checking another library for coding - attempt to use vaex instead of the pandas 0

Implementing relative precission rounding and time benchamrk:

Comparison:

Trying to use pako to inflate data

96 - compression array as a pipeline of action + unit test of compression

DCA example example - need of joins

User interface for joins:

Exception handling:

Problems in Jupyter notebook - TO be updated

Using histogram coding - factor 10^-4 compression for indices (compared 10^-3 using values itself)