Retrieving N lines of records

kozmad commented 11 months ago

Dear Developers,

I'd like to ask your help on what would be the optimal way to implement the following algorithm using mdflib.

I'd like to iterate through the whole MDF file record-by-record and return N-lines/records as a plain matrix or a dataframe. I see there is a method to iterate through groups/datagroups/channels:

for (const auto &channel : observer_list) {
  size_t samples = channel->NofSamples();
  for (size_t sample = 0; sample < samples; ++sample) {

But it seems a bit slow. Based on my experience with other MDF libraries, reading the MDF file record-by-record would result in a better performance, but I cannot see how I can implement this with mdflib. My goal is to extract data as record-channel matrices and propagate them to a Python interface, where I calculate channel/signal statistics. (At first, retrieving data as a double or vector is fine.) I'd like to keep the memory consumption under control, so I'd like to keep number_of_channels number_of_read_records smaller than a predefined constant when I ask for the next data chunk(s). (I know there is no current Python interface for mdflib, I just added it to describe the whole picture of my use case.)

Thank you for your help in advance!

Best regards, Daniel

ihedvall commented 11 months ago

When you call the function ReadData() it reads the file record by record. Before it read the records e.g. the samples, it resize each channel observers internal RAM memory so all samples fits. So after the ReadData() call, all samples and channel values are in primary memory. You can twist your for loops so you first step through each sample and then retrieve the channel values. Note that the channel values are stored in RAM. Its scaled (engineering) value is calculated on the fly.

Conclusion is that read speed is as good at the cost of memory.

The ReadData() function doesn't know anything about the observer. Check the ISampleObserver::OnSample() callback. Your interface is an IChannelObserver object but the implementation is in the ChannelObserver observer class. So the simplest solution is to create a new ChannelObserverEx class that only holds the last sample and add a call back on the Reader class when a sample is added (read). This will solve the memory consumption.

The python library was planned but there already exist an ASAMMDF library written in pure Python. A C# assembly exist so a Python library should be possible through pybind wrapper.

The problem comes when interfacing to Go, Java, Rust... The plan is to use a gRPC interface instead.

Well, there exists a number of solutions for your problem. I suggest a MS Teams meeting or similar application. Note that exist a GitHub project.

kozmad commented 10 months ago

Thank you for the detailed answer. I'm checking it. Regarding interfacing with other languages: I think SWIG is an ultimate tool for this kind of purpose., maybe it would be helpful here as well. (I've tested it interfacing C++ for Python and C# in a similar project and it seemed very efficient.)

ihedvall commented 10 months ago

I propose to close this issue and add 2 new requirement into the MDF project. New OnSample observer interface and Python support (similar to C#).

ihedvall / mdflib

Retrieving N lines of records #41