ceda-icompress python implementations

milankl commented 1 year ago

@nmassey001 just reached out to point towards ceda-icompress another python implementation of the bitinformation algorithm and bit rounding.

The package is xarray-free and I'm curious to know differences in performance as we've been suffering from allocations

Bitrounding A 400MB Float32 array


julia> using BitInformation,  BenchmarkTools
julia> A = rand(Float32,100,100,100,100);
julia> sizeof(A)/1000^2
400.0

julia> @btime round!(A,7) 25.276 ms (0 allocations: 0 bytes)

reaches 16GB/s on my macbook air and (afaik) is memory bounded at that point (max bandwidth of reading from RAM).

2. Bitinformation

```julia
julia> @btime bitinformation($A);
  5.596 s (285 allocations: 13.53 KiB)

The bitinformation algorithm here is essentially allocation free as only the counter array has to be allocated while counting all 00,01,10,11 combinations in the data set. It reaches about 70MB/s and is at that stage on a similar order of magnitude as lossless codecs at higher compression levels.

Neil, would you mind throwing in a similar quick benchmark of ceda-icompress?

nmassey001 commented 1 year ago

Hi @milankl

Here's some timing results. Not quite as fast as the Julia, but also not dreadful.

import numpy as np
import time
from ceda_icompress.BitManipulation.bitshave import BitShave
from ceda_icompress.InfoMeasures.bitinformation import bitinformation

def millis_now():
    return int(round(time.time() * 1000))

def bitshave_test():
    g = np.random.default_rng()
    A = g.random([100,100,100,100],dtype=np.float32)

    start = millis_now()
    b = BitShave(A, 7)
    # start timing
    B = b.process(A)
    end = millis_now()
    print(f"BitShave time: {end - start}ms")

def bitinfo_test():
    g = np.random.default_rng()
    A = np.ma.array(g.random([100,100,100,100],dtype=np.float32))
    start = millis_now()
    bi = bitinformation(A)
    end = millis_now()
    print(f"BitInfo time: {(end - start)/1000}s")

if __name__ == "__main__":
    bitshave_test()
    bitinfo_test()

Results:

BitShave time: 35ms
BitInfo time: 20.732s

nmassey001 commented 1 year ago

If I don't convert the exponent:

BitShave time: 38ms
BitInfo time: 11.423s

milankl commented 1 year ago

Thanks @nmassey001 !!! @observingClouds could you, at some point, compare this to xbitinfo performance?

observingClouds commented 1 year ago

Thanks @nmassey001 for posting these numbers and reaching out to us. I hope I find time soon to provide these numbers as well.

nmassey001 commented 11 months ago

A belated update to this: I've finally got around to implementing the bitpaircount function using Dask, so it can run in parallel. Using 3 threads, the above timings are approximately halved. Using more than 3 threads increases the memory footprint to be more than the physical ram available on my MacBook Pro (16GB). This machine (M1 Pro) has 6 "proper" cores, so 6 threads should be possible, but the memory requirements blow up.

milankl commented 11 months ago

I believe this memory issue is something general we have to work on. Technically the algorithm should be almost allocation free as mentioned above (only a small counter array has to be allocated) but python seems to do something else. I don't know enough about python to easily identify where it allocates and why but this problem seems to get prohibitive for larger datasets. @nmassey001 could you measure your memory allocations too? Maybe this helps @observingClouds to understand why we also seem to copy the array, which we really shouldn't.

observingClouds / xbitinfo

ceda-icompress python implementations #232