ratal / mdfreader

Read Measurement Data Format (MDF) versions 3.x and 4.x file formats in python
Other
169 stars 74 forks source link

added NorCom changes to commit: bb480efd517b420be7e0c1b665097165aaad9700 #98

Closed tbo-norcom closed 6 years ago

tbo-norcom commented 6 years ago

Hi Aymeric,

recently we applied some modifications to the source code of the mdfreader, which we would like to discuss with you. We predominantly focus on using the mdfreader for MDF4.1 files and for parsing subsets of available channels only. In this scenario, the original mdfreader consumes a lot of memory, regardless of whether only a single channel or many channels are parsed (see Figure 1 below for an example).

In the following, I will clarify what kind of changes we applied to the original sources:

  1. Function read4 may act as a generator function (mdf4reader.py) We introduced a new function, which is called ‘read4_generator’. In our use case, this function is called instead of the original ‘read4’ function. The code of the function is almost a copy of the original method. However, instead of adding a channel to a dictionary after creation, we yield the channel. Thus, the generator yields one channel after another. Once the channel is yielded, we clear the data buffer that is associated with this channel. Please note that the original ‘read4’ method can still be used and the generator functionality can be switched on and off via a flag.

  2. Reading data blocks (mdf4reader.py) The original ‘load’ method starts by reading in all data blocks of a data group, independent of the number of channels that should be parsed. The whole (decompressed) data is stored in a bytearray, which later is subject to functions that convert the binary data to the respective data types. Furthermore, data is filtered during the conversion if a set of channel names is provided.

We modified the behavior of the ‘load’ method such that it does not need to concatenate the data blocks into a bytearray. Instead, we are processing each data block individually. Once a data block is parsed, we convert the data and filter for relevant channels. Subsequently, the resulting data is stored in a recarray, which is equivalent to the data structure you are using for returning the parsed data. Finally, we go on to the next data block and discard the binary data of the previous block. This procedure is repeated until all data blocks are processed.

The advantage of parsing the data like this is that we never store the binary data of a complete data group in memory. Obviously, when working on relatively small sets of channels we can save a lot of memory in comparison to the original mdfreader implementation (see Figure 1 for an example). Even if all channels are parsed at once, we profit from the fact that only small chunks of binary data are stored in a buffer. Here, our tests showed that we save up to 50% of memory.

Actually, we applied our changes to version 0.2.2 of the mdfreader. I also integrated our changes to a recent commit (bb480efd517b420be7e0c1b665097165aaad9700). However, I realized that you were working actively on the project during the last weeks and changed a lot of code. Therefore, this pull request is not based on the latest mdfreader version (unfortunately).

It turned out that the version I was working on had some issues with some of our MDF4 files. Therefore, I applied some hotfixes, which are still somewhere in the code. However, it seems that this version (maybe also due to my fixes) is not stable and breaks for some of our test data.

It would be great if you could incorporate our ideas in the current release version. What do you think?

Best regards, Thomas

bildschirmfoto 2017-10-25 um 16 06 59

Figure 1: Runtime (X-axis) and memory consumption (Y-axis) for parsing a set of channels (10 channels, 4843078 data points per channel) with the original mdfreader (black curve) and with the NorCom modification (blue curve)

ratal commented 6 years ago

Hi Thomas, Thanks for sharing these improvements. I am still analysing your code but some clarifications:

  1. What do you think if instead of having this generator, we use noDataLoading=True ; it does not load any data, only metadata/blocks. Then you could use getChannel() or getChannelData() to yield the channel, dataRead will be used behind with bitarray as backup. For the current code, it will actually load data into the dictionnary, but I could add a flag that will avoid this behaviour. This seems to me simple to implement and I am afraid of having a copy function that will be difficult to maintain.
  2. This seems a good idea and appears efficient, I will start to implement it.
tbo-norcom commented 6 years ago

Hi Aymeric, did you find time to implement our suggestions? :)

Best, Thomas

ratal commented 6 years ago

Hi Thomas,

  1. if you agree with first item of my previous comment, it is almost done.
  2. I firstly implemented reading by chunk in mdf3. It is part of 2.7.1 version. So far working ok, I will also implement in version 4.x same snippet.
tbo-norcom commented 6 years ago

OK, sounds great. Thank you!

ratal commented 6 years ago

Hi Thomas, Last commit should give you a hint of the modifications you proposed:

  1. yielding channels using .getChannel is possible with noDataloading=True argument.
  2. reading by chunk is done at several occasions (reading all data, only a few channels, with data list, comrpessed blocks, etc.) chunk size is tunable with variable chunk_size_reading, arbitrarily defined at 100Mb currently but I guess depending of computer, there should be a trade off between reading speed and memory consumption. After a few checks et solving other issues, I will release version 2.7.3 including all your proposals. Aymeric
tbo-norcom commented 6 years ago

Thank you Aymeric, I will give it a try in the next days!

ratal commented 6 years ago

Hi Thomas, I pushed version 2.7.4 that includes additionnal improvements for your use case compared to 2.7.3, please rather use the last one. Aymeric

tbo-norcom commented 6 years ago

Hi Aymeric,

we already integrated the new version in our workflow and it seems to work nicely! However, I am still testing the code and evaluating it's performance. I implemented another small change to new code, which I will communicate within the next days.

Best, Thomas

ratal commented 6 years ago

Hi Thomas, So it working for you ? Do not hesitate to issue another merge for your small modifications. If this is not too big or conflicting with recent changes, I could merge. Otherwise, I will close this pull request

tbo-norcom commented 6 years ago

Hi Aymeric, sorry for my late response. I will try to commit our changes by tomorrow.

Best, Thomas

ratal commented 6 years ago

Hi Thomas, Did you progress in your commit ? Regards Aymeric