Open milankl opened 1 year ago
Hi @milankl
Here's some timing results. Not quite as fast as the Julia, but also not dreadful.
import numpy as np
import time
from ceda_icompress.BitManipulation.bitshave import BitShave
from ceda_icompress.InfoMeasures.bitinformation import bitinformation
def millis_now():
return int(round(time.time() * 1000))
def bitshave_test():
g = np.random.default_rng()
A = g.random([100,100,100,100],dtype=np.float32)
start = millis_now()
b = BitShave(A, 7)
# start timing
B = b.process(A)
end = millis_now()
print(f"BitShave time: {end - start}ms")
def bitinfo_test():
g = np.random.default_rng()
A = np.ma.array(g.random([100,100,100,100],dtype=np.float32))
start = millis_now()
bi = bitinformation(A)
end = millis_now()
print(f"BitInfo time: {(end - start)/1000}s")
if __name__ == "__main__":
bitshave_test()
bitinfo_test()
Results:
BitShave time: 35ms
BitInfo time: 20.732s
If I don't convert the exponent:
BitShave time: 38ms
BitInfo time: 11.423s
Thanks @nmassey001 !!! @observingClouds could you, at some point, compare this to xbitinfo performance?
Thanks @nmassey001 for posting these numbers and reaching out to us. I hope I find time soon to provide these numbers as well.
A belated update to this: I've finally got around to implementing the bitpaircount
function using Dask, so it can run in parallel.
Using 3 threads, the above timings are approximately halved.
Using more than 3 threads increases the memory footprint to be more than the physical ram available on my MacBook Pro (16GB). This machine (M1 Pro) has 6 "proper" cores, so 6 threads should be possible, but the memory requirements blow up.
I believe this memory issue is something general we have to work on. Technically the algorithm should be almost allocation free as mentioned above (only a small counter array has to be allocated) but python seems to do something else. I don't know enough about python to easily identify where it allocates and why but this problem seems to get prohibitive for larger datasets. @nmassey001 could you measure your memory allocations too? Maybe this helps @observingClouds to understand why we also seem to copy the array, which we really shouldn't.
@nmassey001 just reached out to point towards ceda-icompress another python implementation of the bitinformation algorithm and bit rounding.
The package is xarray-free and I'm curious to know differences in performance as we've been suffering from allocations
julia> @btime round!(A,7) 25.276 ms (0 allocations: 0 bytes)
The bitinformation algorithm here is essentially allocation free as only the counter array has to be allocated while counting all 00,01,10,11 combinations in the data set. It reaches about 70MB/s and is at that stage on a similar order of magnitude as lossless codecs at higher compression levels.
Neil, would you mind throwing in a similar quick benchmark of ceda-icompress?