added NorCom changes to commit: bb480efd517b420be7e0c1b665097165aaad9700

tbo-norcom commented 6 years ago

Hi Aymeric,

recently we applied some modifications to the source code of the mdfreader, which we would like to discuss with you. We predominantly focus on using the mdfreader for MDF4.1 files and for parsing subsets of available channels only. In this scenario, the original mdfreader consumes a lot of memory, regardless of whether only a single channel or many channels are parsed (see Figure 1 below for an example).

In the following, I will clarify what kind of changes we applied to the original sources:

Function read4 may act as a generator function (mdf4reader.py) We introduced a new function, which is called ‘read4_generator’. In our use case, this function is called instead of the original ‘read4’ function. The code of the function is almost a copy of the original method. However, instead of adding a channel to a dictionary after creation, we yield the channel. Thus, the generator yields one channel after another. Once the channel is yielded, we clear the data buffer that is associated with this channel. Please note that the original ‘read4’ method can still be used and the generator functionality can be switched on and off via a flag.
Reading data blocks (mdf4reader.py) The original ‘load’ method starts by reading in all data blocks of a data group, independent of the number of channels that should be parsed. The whole (decompressed) data is stored in a bytearray, which later is subject to functions that convert the binary data to the respective data types. Furthermore, data is filtered during the conversion if a set of channel names is provided.

We modified the behavior of the ‘load’ method such that it does not need to concatenate the data blocks into a bytearray. Instead, we are processing each data block individually. Once a data block is parsed, we convert the data and filter for relevant channels. Subsequently, the resulting data is stored in a recarray, which is equivalent to the data structure you are using for returning the parsed data. Finally, we go on to the next data block and discard the binary data of the previous block. This procedure is repeated until all data blocks are processed.

The advantage of parsing the data like this is that we never store the binary data of a complete data group in memory. Obviously, when working on relatively small sets of channels we can save a lot of memory in comparison to the original mdfreader implementation (see Figure 1 for an example). Even if all channels are parsed at once, we profit from the fact that only small chunks of binary data are stored in a buffer. Here, our tests showed that we save up to 50% of memory.

Actually, we applied our changes to version 0.2.2 of the mdfreader. I also integrated our changes to a recent commit (bb480efd517b420be7e0c1b665097165aaad9700). However, I realized that you were working actively on the project during the last weeks and changed a lot of code. Therefore, this pull request is not based on the latest mdfreader version (unfortunately).

It turned out that the version I was working on had some issues with some of our MDF4 files. Therefore, I applied some hotfixes, which are still somewhere in the code. However, it seems that this version (maybe also due to my fixes) is not stable and breaks for some of our test data.

It would be great if you could incorporate our ideas in the current release version. What do you think?

Best regards, Thomas

Figure 1: Runtime (X-axis) and memory consumption (Y-axis) for parsing a set of channels (10 channels, 4843078 data points per channel) with the original mdfreader (black curve) and with the NorCom modification (blue curve)