Open miranov25 opened 3 years ago
Example columns for slice histogram
Could be represented as 2 tables (map of arrays -> CDS) with common bin index Using common bin index we can query
Related toLossless compression of binary data
https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476
Bokeh now supports an unencoded pure binary transfer of arrays. This has afforded a huge speedup over the base64 encoding, e.g. no possible to interactively scrub changes to ~1500x1500 images. If someone really needs compression on top of this, its now possible to do it themselves: send compressed data as a binary array, and decompress in BokehJS with a custom extension. There is also the possibility of using DataShader with Bokeh for very large images (c.f. recent demo of 4 8000x8000 images simultaneously being linked zoomed).
with these options, I can't see a justification for adding this complexity directly to core bokeh, so closing this now.
Hello @pl0xz0rz
You can check issue https://github.com/bokeh/bokeh/issues/5792#issuecomment-336596476 I did not fully understand what functionality had author of comment (https://github.com/bryevdv) in mind, but we should check.
Lets do experiments, so we can send also request to bokeh, explaining gain in data transfer and memory usage. I'm sure it is worth of increase of complexity. This we should demonstrate in toy example.
We can try to summarize requirements and gains in form of google documents (including figures) Our histogram dashboard use case should be used as an example.
coding of histograms - to generate numpy histo
3D random normal distribution: *https://github.com/miranov25/RootInteractive/blob/9526a1cef628aa266ac19fba1b9f5742a87c926c/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L6
edges entropy:
delta edges entropy
histogram content entropy:
if (User coding){
float-> integer
entropy coding
}else{
1. Unique gives amount of distinct value
1. if Unique<<size
1. Entropy for value_counts
2. Entropy for delta.value-counts
3. Use coding with smaller entropy
}
See - https://rootinteractive.web.cern.ch/RootInteractive/data/PWGPP-586/dcaMapResol.ipynb
similar data compression lossy, relative resolution + lossless ) used in the AliRoot
in example below only subset of data used
data volume in dasboard example:
output_file("dcaResol0.html")
optionsAll={"colorZvar":"tpcSignalMMean"}
figureArray = [
[['qPtMean'], ['dcaRTPCN_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCN_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCPull_RMS'], optionsAll],
[['qPtMean'], ['dcaRTPCPull_RMS'], optionsAll],
#
[['qPtMean'], ['dcaRTPCNPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCNPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaRTPCPullPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCPullPrim_RMS'], optionsAll],
#
[['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaRTPCN_RMS/dcaRTPCNPrim_RMS'], optionsAll],
[['qPtMean'], ['dcaZTPCN_RMS/dcaZTPCNPrim_RMS'], optionsAll],
{"size": 4}
]
widgetParams=[
['range', ['tglMean']],
['range', ['tpcSignalMMean']],
['range', ['qPtMean']],
['range', ['alphaMean']],
]
tooltips = [("qPtMean", "@qPtMean")]
widgetLayoutDesc=[ [0,1],[2,3], {'sizing_mode':'scale_width'} ]
figureLayoutDesc=[
[0,1,2,3, {'plot_height':250,"commonY":0,'x_visible':1}],
[4,5,6,7, {'plot_height':250,"commonY":0,'x_visible':1}],
[8,9,10,11, {'plot_height':250,"commonY":4,'x_visible':1}],
{'plot_height':250,'sizing_mode':'scale_width',"legend_visible":True}
]
fig=bokehDrawSA.fromArray(dfDCAMap0, "entries>100", figureArray, widgetParams,layout=figureLayoutDesc,tooltips=tooltips,sizing_mode='scale_width',widgetLayout=widgetLayoutDesc, nPointRender=1000,rescaleColorMapper=True)
Real data use case (5Dimensional DCA histogram) 24. 10^6 bins: https://github.com/miranov25/RootInteractive/blob/d79a5433ad52497560e5311aad911592f5e404c9/RootInteractive/InteractiveDrawing/bokeh/test_Compression.py#L46
# https://stackoverflow.com/questions/47057832/use-zlib-js-to-decompress-python-zlib-compress
import zlib
import pickle
dfD=df.shift(-1)
compress0 = zlib.compress(df["qPtCenter"].to_numpy())
compressD = zlib.compress((dfD["qPtCenter"]-df["qPtCenter"])[0:-1].round(5).to_numpy())
print(len(pickle.dumps(df["qPtCenter"])),len(pickle.dumps(compress0)), len(pickle.dumps(compressD)),len(pickle.dumps(compressD))/len(pickle.dumps(df["qPtCenter"])))
->
192000662 280119 187263 0.0009753247621614972
https://stackoverflow.com/questions/22400652/compress-numpy-arrays-efficiently/29111682#29111682
1.) Noise is in-compressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 24-16 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Non-uniform noise is compressible to the extent to which it is non-uniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.
2.) Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you're sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called "compressive sensing" which is related to this.
3.) A practical suggestion, suitable for many kinds of reasonably continuous data: denoise -> bandwidth limit -> delta compress -> gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] - x[n-1]._
valuesOrig=df["qPtCenter"].unique()
valuesReplace=np.arange(valuesOrig.size)
df["qPtCenterCode8"]=df["qPtCenter"].replace(valuesOrig,valuesReplace).astype("int8")
df["qPtCenterCode16"]=df["qPtCenter"].replace(valuesOrig,valuesReplace).astype("int16")
diffqPt=df["qPtCenterCode8"].diff()[1:].astype("int8")
size0=len(pickle.dumps(df["qPtCenter"].to_numpy()))
size1=len(pickle.dumps(zlib.compress(df["qPtCenter"].to_numpy())))
sizeC8=len(pickle.dumps(zlib.compress(df["qPtCenterCode8"].to_numpy())))
sizeC16=len(pickle.dumps(zlib.compress(df["qPtCenterCode16"].to_numpy())))
sizeD=len(pickle.dumps(zlib.compress(diffqPt.to_numpy())))
print(size0,size1,sizeC,sizeC16,sizeD,sizeC/size0)
...
192000161 280119 23456 46799 23515 0.00012216656422491228
%%time
df["tglCenter"].unique()
CPU times: user 110 ms, sys: 195 µs, total: 110 ms
Wall time: 108 ms
%%time
dfv["tglCenter"].unique()
CPU times: user 477 ms, sys: 6.17 ms, total: 483 ms
Wall time: 62.6 ms
Relative compression - binary with nBits: Compression factor (zlib) was compared with data entropy
test_Compression0()
qPtCenter 50 24000159 23456 0.0009773268585428956
tglCenter 20 24000159 24566 0.0010235765521386755
mdEdxCenter 20 24000159 87640 0.0036516424745352727
V 100 24000159 81611 0.0034004358054461224
alphaCenter 12 24000159 128028 0.005334464659171633
mean 389 48000159 48436 0.001009079990755864
rms 770 48000159 86435 0.0018007232017710607
weight 1139 48000159 13761129 0.28668923784189965
weightR5 115 24000159 7455282 0.3106346920451652
weightR7 329 48000159 12065507 0.25136389652375946
weightR8 529 48000159 12904398 0.26884073446506707
mitest_roundRelativeBinary(df)
0.09624060780074667 0.06656287196095499 0.045204186523616366 0.03675160498174379
0.07041459043666662 0.06584685098152705 0.06259448837842638 0.05465214964511605
4.506533787946664 4.214198462817731 4.006047256219288 3.497737577287427
npm i pako
Compression as pipeline of action implemented RootInteractive/Tools/compressArray.py
Unit test RootInteractive/InteractiveDrawing/bokeh/test_Compression.py
cat testCompression.py | grep -P "(def test|actionArray=)"
def test_CompressionSequence0(arraySize=10000):
actionArray=[("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSequenceRel(arraySize=255,nBits=5):
actionArray=[("relative",nBits), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSequenceAbs(arraySize=255,delta=0.1):
actionArray=[("delta",delta), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSample0(arraySize=10000,scale=255):
actionArray=[("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleRel(arraySize=10000,scale=255, nBits=7):
actionArray=[("relative",nBits), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleDelta(arraySize=10000,scale=255, delta=1):
actionArray=[("delta",delta), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8")]
def test_CompressionSampleDeltaCode(arraySize=10000,scale=255, delta=1):
actionArray=[("delta",delta), ("code",0), ("zip",0), ("base64",0), ("debase64",0),("unzip","int8"),("decode",0)]
Data source compression pseudo-algorithm:
variables:
To represent it we should have 2 CDS
# qPt,tgl.mdEdx.alpha, dCAR
rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
bins = [50, 20, 20, 12, 400]
H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)
# qPt,tgl.mdEdx.alpha, dCAZ
rangeH = ([-5, 5], [-1, 1], [0, 1], [0, 2 * np.pi], [-10 * sigma0 - 1, 10 * sigma0 + 1])
bins = [50, 20, 20, 12, 100]
H, edges = np.histogramdd(sample=np.array([[0, 0, 0, 0, 0]]), bins=bins, range=rangeH)
pako package needed to run compression is not added to the cell. It is simular as discussed in the https://discourse.bokeh.org/t/bokeh-katex-jupyter/4232
Uncaught ReferenceError: pako is not defined
at CDSCompress.inflateCompressedBokehBase64 (eval at append_javascript (main.min.js:45772), <anonymous>:485:23)
at CDSCompress.inflateCompressedBokehObjectBase64 (eval at append_javascript (main.min.js:45772), <anonymous>:513:37)
at CDSCompress.initialize (eval at append_javascript (main.min.js:45772), <anonymous>:469:20)
at CDSCompress.finalize (cdn.bokeh.org/bokeh/release/bokeh-2.2.3.min.js:180)
New functionality for client-side compression is currently under development.
See documentation in https://github.com/miranov25/RootInteractive/blob/master/RootInteractive/InteractiveDrawing/bokeh/doc/READMEcompression.md
New functionality for client-side compression is currently under development. For more details, refer to this issue.
Issue closed.
Data compression - to reduce data transfer between server -> client
Lossy compression of CDS columns
Lossles compression
The same compression on server - > corresponding decompression on client To define which compression to use