Read Big file cost to many memory

songjmcn commented 10 months ago

Hello developer: I find using this library read big MDF file(such as 10GB or 20GB file), the library runing to slow and cost to many memory. I review the data and find that the code whill read all data and store in memory. I think this is not a good . I suggest add a read one frame data interface and only store one frame data in memory, this will cost less memory. I now develop the code and later push the code. Thanks.

songjmcn commented 10 months ago

Already push the code to fast_read branch.

ihedvall commented 10 months ago

The one frame data interface is already existing as the ISampleObserver class. You have 2 types of observers. The channel observer converts all sample values for one channel into a linear array in the primary memory. It is intended to be used when you plot a handful of channels or operate against some type of Apache Arrow storage.

The sample observer, have a callback for each sample (record id) and unlike the channel observer, it only have access to the last value. When using the sample observer, the user is responsible to store or handle the values. It's one problem with sample observer with VLSD CG blocks. When storing variable length data in one CG block while the index into that memory is stored in a channel of another block. As the user only have the current record (frame) available. Typical bus loggers have this problem.

I suspect that you are using the channel observers to read a 10 GB data (DG) block and subscribing on all channels and this requires at least 10GB memory. I think switching to the ISampleObserver interface from the IChannelObserver interface should solve your memory problem or at least it's now your problem.

If you have the time to do 3 tests for me, as I don't have any large file to test with.

First call ReadEverythingWithoutData(), followed by a call to ReadData(), without any observers. Measure time and memory consumption. This defines the base time and memory consumption to compare with.
Do as 1 but add a sample observer optional converting all channel values.
Do as 1 but add channel observers instead.

You can also send me the file but I assume that might generate some other problem.

ihedvall commented 10 months ago

Code snippet from uni test: TEST_F(TestWrite, Mdf4SampleObserver )

 MdfReader reader(mdf_file.string());
  ASSERT_TRUE(reader.IsOk());
  ASSERT_TRUE(reader.ReadEverythingButData());
  const auto* header1 = reader.GetHeader();
  auto* last_dg1 = header1->LastDataGroup();
  ASSERT_TRUE(last_dg1 != nullptr);

  const auto* channel_group1 = last_dg1->GetChannelGroup("CAN_DataFrame");
  ASSERT_TRUE(channel_group1 != nullptr);

  const auto* channel1 = channel_group1->GetChannel("CAN_DataFrame.DataBytes");
  ASSERT_TRUE(channel1 != nullptr);

  ISampleObserver sample_observer(*last_dg1);
  sample_observer.DoOnSample = [&] (uint64_t sample1, uint64_t record_id,
      const std::vector<uint8_t>& record) {
    bool valid = true;
    std::string values;
    if (channel1->RecordId() == record_id) {
      valid = sample_observer.GetEngValue(*channel1,sample1,
                                              record, values );
      std::cout << "Sample: " << sample1
                << ", Record: " << record_id
                << ", Values: " << values << std::endl;
    }
    EXPECT_TRUE(valid);

  };
  reader.ReadData(*last_dg1);
  sample_observer.DetachObserver();
  reader.Close();

songjmcn commented 10 months ago

OK，I will test you suggestion.

ihedvall commented 10 months ago

Your fast reader observer is OK but I think that if you inherit from the ISampleObserver and override the OnSample() function, this would simplify the fast reader interface software. Note there is a problem with SD data block in case they are large. They are currently read into memory before reading the "DT" block as the sample frames have indexes into the SD block. So if you want to save memory, the SD block shouldn't be read into primary memory and instead read through the file pointer. Slower but doesn't consume any memory.

songjmcn commented 10 months ago

I test you suggestion code, but I find readed data is wrong. So I will send you the data, and check the bug. Thanks.

ihedvall commented 10 months ago

Please let me know how and when you transfer the MDF file.

ihedvall commented 10 months ago

I have not understand the goal with the fast reader. Can you explain what you mean with a frame? MDF only have blocks. The meaning of a frame is undefined.

songjmcn commented 10 months ago

I upload the data here： https://1drv.ms/u/s!Ao_zFvyF7O33ga5bMtGvlNvvHwbang?e=VtJ2Y0 Please download the data.

songjmcn commented 10 months ago

I have not understand the goal with the fast reader. Can you explain what you mean with a frame? MDF only have blocks. The meaning of a frame is undefined. I use the libray as follow: When record the data, I use VECTOR hardware and CANAPE. The CANAPE record all signal in every loop. When I want read the data, I want read all signal(in a channel group) at every read lool.

ihedvall commented 10 months ago

Thanks for the file.

I understand why you inherit from the IChannelObserver. It take some time to download the file from the OneDrive. I will not publish the file on GitHub. I will analyze the file and do some testing later today. I store all my files on an internal NAS drive (k:/test/mdf). Some of the unit test (read tests), depends on these files. This might be a good idea to be able to set this path from the cmake. I will investigate if it's possible to do that.

I will return later today with preliminary result, that might be tomorrow for you.

ihedvall commented 10 months ago

The file consist of one measurent (DG) and 3 channel groups (CG). The first channel group have 3 channels (time, frame count and frames. The frame channel have its frame bytes stored in the second CG group. This is a so-called VLSD group.

I measuring the read everything but not data depends on where the file is stored. SSD disc (<1s), local hard drive (15s) and NAS (30s). In real life, it's the NAS that is relevant. Reading the time and frame count channels doesn't consume any large memory consumption but the reading the 3 channels all frame bytes consumes 27 GB memory bytes which is the same as the file size.

The DG data bytes uses 2 DL/DT lists. When the the DT block is split into several blocks, the reader creates a temporary file and copies all the data bytes to that temporary file. This temporary file now simulate a single DT block. This takes some time but doesn't consume any memory.

I have not tested your solution at this time but how do you solve the reading in all frame bytes into memory. I would set add a property to the frame channel observer to ignore VLSD data and just read in the offset and add a function that copies the VLSD data block to a file or something similar. A file channel observer for the frame channel data ?

ihedvall commented 10 months ago

I have added a unit test the branch fast read. I suspect that the call to ReadOneData() is missing which sample to read so, I guess that I read the first sample but this is not important in this test.

The test measure the time to read header, read file info (everything but no data), the time to create the temp file and last the time to read the data from the temp file. Reading header and info is less than 1 second so, they are of minor importance for the time being. Note that no frame data is read in the below tests (Windows 11).

The temp file creation take around 2 minutes while reading all data takes 30 seconds.

The conclusion is that the temp file creation takes to long time and needs to be fixed. Best solution is to skip the temp file reading. Little bit tricky coding but solvable. Maybe it should be simpler to define which sample to read (might be a forgotten check-in).

BENCHMARK Large SSD Storage ReadData()
Open File (ms): 26.8382
Read Info (ms): 79.8333
First Sample (s): 125.24
Last Sample (s): 26.4075

BENCHMARK Large SSD Storage ReadOneData()
Open File (ms): 0.7968
Read Info (ms): 58.0225
Not a DT block, linkd data block list into temporary
File total size:28466571996
First Sample (s): 165.199
Last Sample (s): 0.0482707

ihedvall commented 10 months ago

It's possible to play with the read/write buffer size to come down to First Sample time of 60 s. There is another request that some hardware/compiler combinations does support temp files so the best solution is to make a ReadCache object that returns record buffers and have a DataBlockList object as input. I will do that on the main branch.

If the above is successful, I will do a channel observer where it is possible to select a low and high range of samples.

ihedvall commented 10 months ago

I have done the ReadCache class and added a performance test. The read speed has been improved, definitive when reading the file without subscribing to the VLSD (frame data). Note that the NAS is connected to a 1GBit/s network, So the min read time for the entire file will be 280 s.

I have some cleanup to do and then I have to fix a ReadOneSampleData() function.

BENCHMARK Large (Local SSD Drive) Storage ReadData(without VLSD data)
Open File (ms): 1.4001
Read Info (ms): 79.133
Read Data (s): 0.0078827

BENCHMARK Large (Local SSD Drive) Storage ReadData(with VLSD data)
Open File (ms): 0.2668
Read Info (ms): 10.7877
Read Data (s): 19.5854

BENCHMARK Large (Local SSD Drive) Storage ReadData(VLSD offset data)
Open File (ms): 0.2783
Read Info (ms): 15.7814
Read Data (s): 0.0057762

BENCHMARK Large (Local Mechanical Drive) Storage ReadData(without VLSD data)
Open File (ms): 99.5428
Read Info (ms): 5130.84
Read Data (s): 0.0460032

BENCHMARK Large (Local Mechanical Drive) Storage ReadData(with VLSD data)
Open File (ms): 0.2383
Read Info (ms): 12.5052
Read Data (s): 244.95

BENCHMARK Large (Local Mechanical Drive) Storage ReadData(VLSD offset data)
Open File (ms): 0.2743
Read Info (ms): 14.0795
Read Data (s): 0.0051807

BENCHMARK Large (NAS) Storage ReadData(without VLSD data)
Open File (ms): 5.3059
Read Info (ms): 1439.42
Read Data (s): 0.807407

BENCHMARK Large (NAS) Storage ReadData(with VLSD data)
Open File (ms): 5.1699
Read Info (ms): 839.768
Read Data (s): 339.223

BENCHMARK Large (NAS) Storage ReadData(VLSD offset data)
Open File (ms): 8.1906
Read Info (ms): 1453.88
Read Data (s): 0.74899

songjmcn commented 10 months ago

When I get the channel signal id, all channels id is same. As bollow:

I20240108 19:42:59.260455 7607659 mdf_test.cpp:23] ReadEverythingButData:1 I20240108 19:42:59.261612 7607659 mdf_test.cpp:31] group Index:2416 I20240108 19:42:59.261631 7607659 mdf_test.cpp:38] Channel name:on Rawdata(VC0) Received I20240108 19:42:59.261644 7607659 mdf_test.cpp:41] Sub channel name:t Id:1 I20240108 19:42:59.261706 7607659 mdf_test.cpp:41] Sub channel name:FrameCounter_VC0 Id:1 I20240108 19:42:59.261716 7607659 mdf_test.cpp:41] Sub channel name:VideoRawdata_VC0 Id:1 Because the wrong id, I can not read the right value.

ihedvall commented 10 months ago

With Id, I assume you mean the IChannel::Index() function. The Index() returns the file position of the block but there might be an issue with so-called composite channels that isn't stored on the file. It looks OK in the MDF File Viewer. Can you give me more info?

Maybe a copy of the mdf_test.cpp? Using mdflib.lib or the mdflibrary.dll? Might be an OS problem, Linux ?

ihedvall commented 10 months ago

Maybe we should check if it is simpler error as in the formatting of output text. "Id: 1", the 1 might be treated as a boolean not as an int64?

songjmcn commented 10 months ago

I used the IChannelGroup's RecordId function. The code as follow:

for (const auto *cg4: cg_list) {
            LOG(INFO) << "  Channel name:" << cg4->Name();
            const auto cn_list = cg4->Channels();
            for (const auto *cn4: cn_list) {
                LOG(INFO) << "    Sub channel name:" << cn4->Name() << " Id:" << cn4->RecordId();
                // Create a subscriber and add it to the temporary listcn
                if (cn4->Name() == "FrameCounter_VC0") {
                    frame_counter_observer = CreateChannelObserver(*dg4, *cg4, *cn4);

                }
                else if (cn4->Name() == "VideoRawdata_VC0") {
                    frame_data_observer = CreateChannelObserver(*dg4, *cg4, *cn4);
                }
            }
        }

I found all the channles return the same id.

songjmcn commented 10 months ago

I used the new library, but I can not read byte array from the MF4. The code as bollow:

for (auto *dg4: dg_list) {
        LOG(INFO) << "group Index:" << dg4->Index();
        // Subscribers holds the sample data for a channel.
        // You should normally only subscribe on some channels.
        // We need a list to hold them.
        ChannelObserverList subscriber_list;
        auto cg_list = dg4->ChannelGroups();
        ChannelObserverPtr frame_counter_observer = nullptr;
        ChannelObserverPtr frame_data_observer = nullptr;
        for (const auto *cg4: cg_list) {
            LOG(INFO) << "  Channel name:" << cg4->Name();
            const auto cn_list = cg4->Channels();
            for (const auto *cn4: cn_list) {
                LOG(INFO) << "    Sub channel name:" << cn4->Name() << " Id:" << cn4->RecordId();
                // Create a subscriber and add it to the temporary listcn
                if (cn4->Name() == "FrameCounter_VC0") {
                    frame_counter_observer = CreateChannelObserver(*dg4, *cg4, *cn4);

                }
                else if (cn4->Name() == "VideoRawdata_VC0") {
                    frame_data_observer = CreateChannelObserver(*dg4, *cg4, *cn4);
                }
            }
        }

        // Now it is time to read in all samples
        float time_start = (float) clock();
        reader.ReadData(*dg4); // Read raw data from file
        float time_end = (float) clock();
        LOG(INFO) << "Read channel group cost time:" << ((time_end - time_start) / CLOCKS_PER_SEC) * 1000 << "ms";
        double channel_value = 0.0; // Channel value (no scaling)
        unsigned int frame_counter = 0;
        //std::string frame_data;

        for (size_t sample = 0; sample < frame_counter_observer->NofSamples(); ++sample) {

            //const auto eng_valid = obs->GetEngValue(sample, eng_value);
            auto data_valid = frame_counter_observer->GetChannelValue(sample, frame_counter);
            LOG(INFO) << "read valid:" << data_valid << ", frame counter : " << frame_counter;
            std::vector<uint8_t> frame_data;
            data_valid = frame_data_observer->GetChannelValue(sample, frame_data);
            LOG(INFO) << "read valid:" << data_valid << ", frame data size:" << frame_data.size();
            if (data_valid) {
                /*
                del_sci_data2image((uint8_t *) frame_data.data(), IMAGE_WIDTH * IMAGE_PIXEL_LEN,
                                   IMAGE_HEIGHT * IMAGE_WIDTH * IMAGE_PIXEL_LEN, uyvy_image.ptr());
                cv::Mat bgr_img;
                cv::cvtColor(uyvy_image, bgr_img, cv::COLOR_YUV2BGR_UYVY);
                cv::imshow("img", bgr_img);
                cv::waitKey(0);
                 */

            }

            // You should do something with data here

        }

ihedvall commented 10 months ago

If you want a unique ID for the the channel, you should use the function Index() not the function RecordId(). The RecordId() function returns the channels parent group record ID. The Index() function, exists on all MDF blocks and is identical with the blocks file position.

ihedvall commented 10 months ago

I detected that no raw data (byte array) was returning for VLSD CG data. This is a bug and I have correct it but not checked in the changes in GIT. I suppose the read of the file will run fast (some seconds). If you want to read the entire 28GB file from a network drive (NAS), the transfer time assuming a 1GBit/s network, is 280 seconds.

I detected the error while I was trying to shorten the read time. This can be done by skipping records that doesn't needs to be read. Skipping means that you step the file pointer forward instead of reading bytes.

I am currently working with the following solution. Assume that the file is large and most of raw data is variable length signal data (VLSD channel).

The first step is to read all channel data (ReadData) but mark the VLSD channel to only store the offset into the raw (signal) data. You set the VLSD channel observer, to not read raw data.
The next step is to read the channels VLSD data (ReadVLSD) by using the offset list from the previous read. You may shorten the read time by optimize the offset list. The ReadVSLD() function needs the VLSD channel, a list of offsets to read and function pointer (typically a lamda function), that returns the raw byte array.

The lamda (callback) function typical include the "if (valid)" OpenCV stuff in your code. Note that sample number have a 1:1 relation to offsets but there may be less offsets than samples.

I am half way through the above solution but I think I will be ready today with the above function. Will the above solution solve your memory problem?

ihedvall commented 10 months ago

I am ready with the MdfReader::ReadVlsdData() function. Example below. The first ReadData(), reads the sample to offset relation. Note that the IChannelObserver->ReadVlsdData(false) call, stops the read of VLSD data.

The second ReadVlsdData() , fetch the frame data. Set the number frame to return in the supplied callback function.

Reading one frame should take less than 2 seconds.

 auto frame_observer = CreateChannelObserver(*last_dg,  *channel_group, *frame_channel);
 frame_observer->ReadVlsdData(false);

 oRead.ReadData(*last_dg);

// Sample -> Offset relation 
 const auto& offset_list = frame_observer->GetOffsetList();

// Create a filter list i.e. which offsets to read. This example the first 100 samples
auto filter_list = offset_list; // Sample -> Offset relation 
while (filter_list.size() > 100) {
        filter_list.pop_back();
}

std::function callback = [&](uint64_t offset, const std::vector<uint8_t> &buffer) 
{
     /* Do something with the raw data*/
};

oRead.ReadVlsdData(*last_dg, *frame_channel, filter_list, callback);

songjmcn commented 9 months ago

I am ready with the MdfReader::ReadVlsdData() function. Example below. The first ReadData(), reads the sample to offset relation. Note that the IChannelObserver->ReadVlsdData(false) call, stops the read of VLSD data.

The second ReadVlsdData() , fetch the frame data. Set the number frame to return in the supplied callback function.

Reading one frame should take less than 2 seconds.
 auto frame_observer = CreateChannelObserver(*last_dg,  *channel_group, *frame_channel);
 frame_observer->ReadVlsdData(false);

 oRead.ReadData(*last_dg);

// Sample -> Offset relation 
 const auto& offset_list = frame_observer->GetOffsetList();

// Create a filter list i.e. which offsets to read. This example the first 100 samples
auto filter_list = offset_list; // Sample -> Offset relation 
while (filter_list.size() > 100) {
        filter_list.pop_back();
}

std::function callback = [&](uint64_t offset, const std::vector<uint8_t> &buffer) 
{
     /* Do something with the raw data*/
};

oRead.ReadVlsdData(*last_dg, *frame_channel, filter_list, callback);

I test new library. The reading VLSD data bug fixed. The speed is more faster. I think the new code can be used.

ihedvall / mdflib

Read Big file cost to many memory #67