ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 74 forks source link

improvement: memory usage for MDF4 files #72

Closed danielhrisca closed 6 years ago

danielhrisca commented 7 years ago

With the test file the memory usage goes to 2.8GB. I think that there is a memory leak worth investigating.

ratal commented 7 years ago

I tried with my version and memory is limited to 700Mo and 500Mo when using convertAfterRead=False argument. Again with python 3.5.3 and numpy 1.11.2 I am on numpy discussion list and there seems to be for the moment several memory issues with python 3.6 that should be fixed with numpy 1.13.1 --> what numpy version are you using ?

danielhrisca commented 7 years ago

I got the high memory usage when saving the mf4 file to disk. Memory usage for file opening was around 700MB like you said. (Using Python 3.6.1 x64, Windows 7 x64, numpy 1.13, mdfreader 0.2.5).

ratal commented 7 years ago

This could be normal. Data stored in mdf4 file could be compressed and using much less memory because it is using specific data type like unit8 that are then converted into float for instance (based on CCBlock) that will take much more memory back in a mdf4 file. This conversion is avoided with argument convertAfterRead=False during reading but the writing is not using original data type, only the converted type. However, there could be a pointer issue in the writing function that inflates the file. I will try to reproduce your issue while writing.

ratal commented 6 years ago

I tried on my dev platform (debian) and I barely consummed 0.2GB during writing.

danielhrisca commented 6 years ago

I use this benchmark for evaluation. You can double check on your machine

https://github.com/danielhrisca/asammdf/tree/master/benchmarks

danielhrisca commented 6 years ago

Hi,

using the two test files (mdf version 3 and 4) I have:

Save file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 26894 2002
mdfreader 0.2.6 mdfv4 25403 2715
ratal commented 6 years ago

Hi Daniel, So far on Linux:

danielhrisca commented 6 years ago

The file is ok. I don't know what you have on your Dev PC but if you install mdfreader from pypi or GitHub the results are as I have shown (tested on Linux, Windows, python 2.7 and python 3.6). PS: you have the proper test file

ratal commented 6 years ago

Hi, Just tried on Win10 64 bit anaconda 4.3.1 (python 3.6.0) and winPython 3.6.1 (virtual machine in same linux machine)

RuntimeWarning: invalid value encountered in multiply return vect * P2 + P1

My command is relatively same as your benchmark (no timer): yop=mdfreader.mdf('error.mdf') yop.write() I do not get it

danielhrisca commented 6 years ago

Hi Aymeric, mdf3 reading is indeed about 3.5s mdf4 reading is slow both mdf3 and mdf4 write is slow and consume a lot of RAM in my tests

ratal commented 6 years ago

Ok, got confused by the issues, I will check the RAM consumption during writing.

ratal commented 6 years ago

It seems to be usage of pack() being a mistake. I will have to investigate an alternative like .tobytes() from numpy

ratal commented 6 years ago

Hi Daniel, Found alternative of pack using records fromarrays() and tobytes(). --> big speed up and much lower memory consumption. However, I will have to test it more in detail. Still mdf3 could be further speed up, next.

danielhrisca commented 6 years ago

Hello Aymeric,

why is there such a high RAM usage for mdf version 4 with noDataLoading=True ?

Benchmark environment

Notations used in the results

Files used for benchmark: * 183 groups * 36424 channels

Open file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 3698 542
mdfreader 0.2.6 compression mdfv3 5041 262
mdfreader 0.2.6 noDataLoading mdfv3 1933 193
mdfreader 0.2.6 mdfv4 42596 1315
mdfreader 0.2.6 compression mdfv4 46789 1027
mdfreader 0.2.6 noDataLoading mdfv4 5001 948
ratal commented 6 years ago

Hi Daniel, I found a lazy coding part handling text channel and its encoding. I improved a bit code and speed should be drastically reduced for mdf4 reading. However, RAM usage with noDataLoading is still too high indeed. Work in progress as you could notice.

ratal commented 6 years ago

Hi Daniel, By the way, did you try to benchmark with 'big' file ( >1Gb) and much less number of channels ? (<1000channels).

danielhrisca commented 6 years ago

Benchmark environment

Notations used in the results

Files used for benchmark:

Open file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 264 567
mdfreader 0.2.6 compression mdfv3 838 531
mdfreader 0.2.6 compression bcolz 6 mdfv3 1625 543
mdfreader 0.2.6 noDataLoading mdfv3 4 92
mdfreader 0.2.6 mdfv4 273 586
mdfreader 0.2.6 compression mdfv4 844 610
mdfreader 0.2.6 compression bcolz 6 mdfv4 1635 613
mdfreader 0.2.6 noDataLoading mdfv4 7 94
Save file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 2117 1011
mdfreader 0.2.6 compression mdfv3 2188 938
mdfreader 0.2.6 compression bcolz 6 mdfv3 2663 937
mdfreader 0.2.6 mdfv4 1967 1011
mdfreader 0.2.6 compression mdfv4 2115 939
mdfreader 0.2.6 compression bcolz 6 mdfv4 2381 937
Get all channels Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 0 566
mdfreader 0.2.6 compression mdfv3 331 531
mdfreader 0.2.6 compression bcolz 6 mdfv3 524 543
mdfreader 0.2.6 mdfv4 0 586
mdfreader 0.2.6 nodata mdfv4 272 551
mdfreader 0.2.6 compression mdfv4 328 610
mdfreader 0.2.6 compression bcolz 6 mdfv4 520 613
danielhrisca commented 6 years ago

results with commit https://github.com/ratal/mdfreader/commit/36dfe4aae917eb9d232e639bf603f30dfec5d7fa

Benchmark environment

Notations used in the results

Files used for benchmark:

Open file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 3744 542
mdfreader 0.2.6 compression mdfv3 5163 263
mdfreader 0.2.6 compression bcolz 6 mdfv3 5288 1035
mdfreader 0.2.6 noDataLoading mdfv3 2047 193
mdfreader 0.2.6 mdfv4 7337 1315
mdfreader 0.2.6 compression mdfv4 8517 1027
mdfreader 0.2.6 compression bcolz 6 mdfv4 9082 1750
mdfreader 0.2.6 noDataLoading mdfv4 5348 948
Save file Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 7273 574
mdfreader 0.2.6 noDataLoading mdfv3 9414 574
mdfreader 0.2.6 compression mdfv3 7629 536
mdfreader 0.2.6 compression bcolz 6 mdfv3 7231 1035
mdfreader 0.2.6 mdfv4 4293 1336
mdfreader 0.2.6 noDataLoading mdfv4 6205 1336
mdfreader 0.2.6 compression mdfv4 4911 1292
mdfreader 0.2.6 compression bcolz 6 mdfv4 4776 1767
Get all channels (36424 calls) Time [ms] RAM [MB]
mdfreader 0.2.6 mdfv3 93 542
mdfreader 0.2.6 nodata mdfv3 118503 414
mdfreader 0.2.6 compression mdfv3 718 266
mdfreader 0.2.6 compression bcolz 6 mdfv3 345 1036
mdfreader 0.2.6 mdfv4 96 1314
mdfreader 0.2.6 nodata mdfv4 172578 1185
mdfreader 0.2.6 compression mdfv4 731 1035
mdfreader 0.2.6 compression bcolz 6 mdfv4 455 1758
ratal commented 6 years ago

I reduced memory use generally to almost original file data in last commit. However, bcolz seems disappointing, maybe too much overhead for each channels, blosc is much better for this use case.

danielhrisca commented 6 years ago

Hello Aymeric,

good work, the memory usage has been improved a lot since 0.2.5.

Regarding bcolz it is indeed not suitable by default; it would probably work better with a transposition of the data block records. For myself I've already dropped all compression options since it was performing worse then not loading the raw record data.

Results

Benchmark environment

Notations used in the results

Files used for benchmark:

Open file Time [ms] RAM [MB]
mdfreader 0.2.7 mdfv3 4319 458
mdfreader 0.2.7 compress mdfv3 5997 195
mdfreader 0.2.7 compress bcolz 6 mdfv3 6117 947
mdfreader 0.2.7 noDataLoading mdfv3 1711 187
mdfreader 0.2.7 mdfv4 5705 467
mdfreader 0.2.7 compress mdfv4 7174 183
mdfreader 0.2.7 compress bcolz 6 mdfv4 7331 907
mdfreader 0.2.7 noDataLoading mdfv4 4172 261
Save file Time [ms] RAM [MB]
mdfreader 0.2.7 mdfv3 8704 481
mdfreader 0.2.7 compress mdfv3 8672 451
mdfreader 0.2.7 compress bcolz 6 mdfv3 8398 949
mdfreader 0.2.7 mdfv4 6669 489
mdfreader 0.2.7 compress mdfv4 8216 446
mdfreader 0.2.7 compress bcolz6 mdfv4 6642 922
Get all channels (36424 calls) Time [ms] RAM [MB]
mdfreader 0.2.7 mdfv3 68 458
mdfreader 0.2.7 compress mdfv3 645 196
mdfreader 0.2.7 compress bcolz 6 mdfv3 272 949
mdfreader 0.2.7 mdfv4 67 467
mdfreader 0.2.7 compress mdfv4 670 189
mdfreader 0.2.7 compress bcolz 6 mdfv4 295 914
danielhrisca commented 6 years ago

I guess it's up to you if you want to close this issue.

ratal commented 6 years ago

ok, thanks. I will review later compression status.