MemoryError when resampling file opened with compression='blosc'

ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python

Other

169 stars 74 forks source link

MemoryError when resampling file opened with compression='blosc' #104

Closed danielhrisca closed 6 years ago

danielhrisca commented 6 years ago

from mdfreader import mdf

filename = 'test.mdf'
x1 = mdf(filename , compression='blosc')
x1.resample(0.01)

gives

masterData = arange(min(minTime), max(maxTime), samplingTime)
MemoryError
    x1.resample(0.01)
  File "E:\WinPython-64bit-3.6.1.0Qt5\python-3.6.1.amd64\lib\site-packages\mdfreader\mdfreader.py", line 586, in resample
Traceback (most recent call last):
  File "E:\WinPython-64bit-3.6.1.0Qt5\notebooks\untitled6.py", line 5, in <module>

print(min(minTime), max(maxTime), samplingTime)
>>> 0.012625 1.83746864836e+13 0.01

ratal commented 6 years ago

Hi Daniel, Thanks for pointing out. It seems blosc compression is relatively odd; data are not same after decompression, giving 1E13 max for time channel -->arange is allocating too much memory. Bcolz has poor performance or even worse with vectors and now blosc is actually dodgy, I am like you tempted to give up on compression... Maybe just a zlib would be more relyable.

danielhrisca commented 6 years ago

Blocks is advertised as lossless, so maybe there is some other reason for the odd value.

ratal commented 6 years ago

Yes, maybe but seems rather either numpy or blosc bug. When I use self.data = compress(a.tobytes()) for compression and fromstring(decompress(self.data), dtype=self.dtype) for decompression, I have the correct data back. compression pointer is advised for speed but it seems a bit risky. Either numpy does not show correct pointer for __array_interface__ or bug in blosc. Anyay I made a quick patch that seems to be working in last commit.

danielhrisca commented 6 years ago

Trying to merge the two test files on 32 bit python raises a MemoryError

ratal commented 6 years ago

Hi Daniel, Can you detail a bit more the script ? I tried to do the followin but could not reproduce MemoryError

x1=mdf('test.mf4', compression='blosc)
x1.resample(0.01)
x2=mdf('tests.mdf',compression='blosc')
x2.resample(0.01)
x1.mergeMDF(x2)

Maybe you simply ask too much to your machine (I have 16Gb, 64bit python 3.5.3)

danielhrisca commented 6 years ago

I mentioned that the error occurs on 32 bit Python, not 64 bit Python.

ratal commented 6 years ago

I could read it, but I do not have a 32bit OS to investigate easily, I will have to setup a virtual machine and so on... I will try to check how much memory it is allocating. googling a bit shows numpy could be limited in 32bit to 3.2Gb or worse 2Gb depending of compilation.

danielhrisca commented 6 years ago

On Windows you could simply use a 32 bit WinPython distribution

danielhrisca commented 6 years ago

It seems the optimizations done since this issue was raised have lowered the ram usage enough to avoid a memory error with the test files.