Refactoring: Saving data

NP199 commented 1 month ago

I wanted to save a big chunck of data and the programm crashed. The measurment was like 3 hours long, which worked fine, but when i tried to save it the program chrashed.

I think it will be important to give the User somekind of progressbar or feedback that the data will be saved.

My first thought on this was to save the data chunckwise and delte the already written chunck from the data vector until its empty. The file stream can append each chunck to the file. Since the vector shrinks in size it could also be an indicator for an progressbar or feedback

AKMaily commented 2 weeks ago

I guess a solution would be to implement two features. The first one is to create a temp folder in the OmniView directory where the data chunks are saved temporarily while the measurement is taken. If the user wants to save the data, the saved file in the temp folder will be moved to the saves folder or the path the user selected. If the user does not want to save the file or if the user finished saving it in another folder the temp folder will be deleted. The second feature should be a progress bar that the user can see while the file from the one folder is moved to the folder he selected. @NP199

R-Abbasi commented 1 week ago

I think it will be important to give the User somekind of progressbar or feedback that the data will be saved.

The progress bar is a good idea but it can't be that helpful regarding the crash, I guess.

Took a measurement for about 10 minutes and saved it, but Windows couldn't open the created .csv file. Saw about 60 million lines/values when opened with Notepad. Most of the values are repetitive (sort of a recurrent pattern). Not sure if it's required to keep them intact.
Values, as I've explored, are positive and don't exceed 255. Hence, we may be able to use std::vector<std::pair<std::uint16_t, std::int16_t>> samples; like std::vector<std::pair<std::uint16_t, std::uint8_t>> samples; to occupy less memory.
As well as, not sure if the number of lines is entirely necessary. If not, std::vector<std::uint8_t> samples; seems much better.

My first thought on this was to save the data chunckwise and delte the already written chunck from the data vector until its empty.

It will make the issue worse, I assume.

AKMaily commented 1 week ago

Took a measurement for about 10 minutes and saved it, but Windows couldn't open the created .csv file. Saw about 60 million lines/values when opened with Notepad. Most of the values are repetitive (sort of a recurrent pattern). Not sure if it's required to keep them intact.

I guess it could be a good idea to give the user the opportunity to set the sampling rate that will be saved. In this case we could reduce the saved values at least for the workshop cases.

Values, as I've explored, are positive and don't exceed 255. Hence, we may be able to use std::vector<std::pair<std::uint16_t, std::int16_t>> samples; like std::vector<std::pair<std::uint16_t, std::uint8_t>> samples; to occupy less memory.

The values depend on what voltage is measured. As the software should be used for different devices later on we should not set a certain size for the saved values.

NP199 commented 1 week ago

Values, as I've explored, are positive and don't exceed 255. Hence, we may be able to use std::vector<std::pair<std::uint16_t, std::int16_t>> samples; like std::vector<std::pair<std::uint16_t, std::uint8_t>> samples; to occupy less memory.

The values can be in an 14 bit range so 16 bit should be fine.

It will make the issue worse, I assume.

The data capturing and displaying does work, but the saving part crashes and one problem could be that we try to copy a big vector into another one. So i want to avoid copying the whole vector and saving the data as a stream into the file. When we use an range_view or something like that onto the data, and delete the already saved data, the vecotr whill slowly decrease in size until all data is appended to the file. And since it is only a view on a subrange the copying is either not necessary anymore or will work because it is only a small chunck of the data.

The progressbar could be implemented ontop of that procedure. As an example it could work like this: Evaluate the chunks that need to be stored away. I assume the chunks have an fixed size. Than just count until every chunk is stored an then update the progressbar until every chunck is written to the file.

Any thoughts on this

NP199 commented 1 week ago

I guess it could be a good idea to give the user the opportunity to set the sampling rate that will be saved. In this case we could reduce the saved values at least for the workshop cases.

The user does not specify the sample rate. The sample rate is specified from the device which is connected to the software.

The values depend on what voltage is measured. As the software should be used for different devices later on we should not set a certain size for the saved values.

You cannot specify no size either. Right now the possible value which is send by the device is an 14 bit value, so the next best thing is an 16 bit value and in the future the device could have an embedded information whioch stores the possible y range and sample rate in itself like the omnaiscope right now

R-Abbasi commented 1 week ago

one problem could be that we try to copy a big vector into another one. So i want to avoid copying the whole vector and saving the data as a stream into the file.

That's a good point. The initial implementation of the save function copies the data into a std::string that could likely be fixed:

std::vector<std::pair<double,double>> samples{
    {0, 121},  {1, 107},  {2, 81},   {3, 79},   {4, 103},  {5, 121},
    {6, 107},  {7, 80},   {8, 79},   {9, 104},  {10, 121}, {11, 107},
    {12, 81},  {13, 79},  {14, 105}, {15, 121}, {16, 106}, {17, 81},
    {18, 79},  {19, 104}, {20, 121}, {21, 107}, {22, 80},  {23, 79},
    {24, 104}, {25, 121}, {26, 107}, {27, 80},  {28, 78},  {29, 104},
    {30, 121}, {31, 106}, {32, 80},  {33, 79},  {34, 104}, {35, 121},
    {36, 107}, {37, 80},  {38, 79},  {39, 104}, {40, 121}, {41, 106},
    {42, 81},  {43, 79},  {44, 104}, {45, 121}, {46, 106}, {47, 80},
    {48, 79},  {49, 104}, {50, 121}, {51, 106}, {52, 80},  {53, 79},
    {54, 105}, {55, 121}, {56, 105}, {57, 80},  {58, 79},  {59, 104},
    {60, 121}, {61, 106}, {62, 79},  {63, 80},  {64, 105}, {65, 121},
    {66, 106}, {67, 80},  {68, 79},  {69, 105}, {70, 121}, {71, 105},
    {72, 80},  {73, 79},  {74, 105}, {75, 121}, {76, 106}, {77, 80},
    {78, 80},  {79, 105}, {80, 121}, {81, 106}, {82, 80},  {83, 79},
    {84, 105}, {85, 121}, {86, 105}, {87, 79},  {88, 80},  {89, 105},
    {90, 121}, {91, 106}, {92, 79},  {93, 79},  {94, 106}, {95, 121},
    {96, 105}, {97, 80},  {98, 80},  {99, 105}};

const std::string outFile{"test.csv"};
std::fstream file{outFile};
file.open(outFile, std::ios::out | std::ios::app);

if (!file.is_open()) {
  file.clear();
  std::print("Could not create {} for writing!\n", outFile);
} else {
  std::print("Start saving {}.\n", outFile);
  for (const auto &e : samples)
    file << e.first << ',' << e.second << '\n';
  file.flush();
  file.close();
  std::print("Finished saving.\n");
}

When we use an range_view or something like that onto the data, and delete the already saved data,

Deletion at the beginning of std::vector (the first elements) presumably moves the rest of the elements back to the beginning. That's one of the costly parts that could make the issue worse, I suppose.

std::deque might have better performance in this regard. I might be wrong but the idea still doesn't sound so much good to me.

NP199 commented 1 week ago

Deletion at the beginning of std::vector (the first elements) presumably moves the rest of the elements back to the beginning. That's one of the costly parts that could make the issue worse, I suppose. std::deque might have better performance in this regard. I might be wrong but the idea still doesn't sound so much good to me.

You can just take a subrange of the vector until all data is copied, so nothing is deleted or added or copied. And we would hva a chance to update about the progress.

Your would be fine too. Just add an count to the range loop, and update an progressbar which max is calucalted before with the size of an the vector or something like that

AKMaily commented 1 week ago

I guess it could be a good idea to give the user the opportunity to set the sampling rate that will be saved. In this case we could reduce the saved values at least for the workshop cases.

The user does not specify the sample rate. The sample rate is specified from the device which is connected to the software.

Yes, but in regards to saving the data in a file it could be a possibility to set a "sample rate" where only every 5th or 10th value is saved to reduce the amount of data that is saved in the files. The sampling rate of the device would still be the same, but the .csv file would be smaller

The values depend on what voltage is measured. As the software should be used for different devices later on we should not set a certain size for the saved values.

You cannot specify no size either. Right now the possible value which is send by the device is an 14 bit value, so the next best thing is an 16 bit value and in the future the device could have an embedded information whioch stores the possible y range and sample rate in itself like the omnaiscope right now

That sounds like a good idea

R-Abbasi commented 1 week ago

To write the elements you need to access them. Accessing by itself seems quite simple and quick by default as std::vector makes use of contiguous memory. Writing the data into a file, on the other hand, has much more stuff to consider in regard with the current issue, I suppose.

Wrote this and took a measurement for 10 minutes:

static void save(const Omniscope::Id &device,
                 const std::vector<std::pair<double, double>> &values,
                 fs::path const &outFile, std::string allData) {

 // this section can be removed later on 
  std::vector<double> y_values(values.size()); // memory allocation only once 
  for (size_t i = 0; i < y_values.size(); i++)
    if (values[i].second)
      y_values[i] = values[i].second;

  std::string fileContent;
  const std::string::size_type new_cap{values.size()};
  fileContent.reserve(new_cap); // memory allocation only once 
  fileContent += allData;
  fileContent += fmt::format("\n{}-{}\n", device.type, device.serial);

  std::ofstream file(outFile, std::ios::binary);
  if (!file.is_open()) {
    file.clear();
    fmt::println("Could not create {} for writing!", outFile);
    return;
  }
  fmt::println("Start saving {}.", outFile);
  file.write(fileContent.data(), // write the strings and device ID each set on a line
             static_cast<std::streamsize>(fileContent.size()) * sizeof(char));
  file.write(reinterpret_cast<const char *>(y_values.data()), // write y-values
             static_cast<std::streamsize>(y_values.size()) * sizeof(double));
  file.flush();
  file.close();
  fmt::println("Finished saving.");
}

It took only 5 seconds to write the data into the file on my machine while it used to be about 124 seconds for the same measuring time.

NP199 commented 1 week ago

Yes this would be fast but this isnt not readable anymore or am i using it wrong?

NP199 commented 1 week ago

The cast from double to string does take a lot of time and if we can avoid it, that would reduce the time a lot. Maybe a .wav file format would be better?

R-Abbasi commented 1 week ago

Yes this would be fast

You'll be experiencing 3 to 4 times faster speed with this version in which the y_values vector is remove, I assume:

 file.write(fileContent.data(),
            static_cast<std::streamsize>(fileContent.size()) * sizeof(char));
 for (const auto &v : values)
   file.write(reinterpret_cast<const char *>(&v.second), sizeof(double));

but this isnt not readable anymore or am i using it wrong?

We aren't expected to read it ourselves I guess; we read it through code, e.g.,:

std::ifstream file(outFile, std::ios::binary | std::ios::ate);
if (!file.is_open()) {
  file.clear();
  fmt::println("Could not create {} for reading!", outFile);
  return;
}
auto size = std::filesystem::file_size(outFile);
file.seekg(0);
std::string line1, line2; // the first two lines of the file
std::getline(file, line1);
std::getline(file, line2);
size -= (static_cast<std::streamoff>(line1.size()) + line2.size());
auto count = size / sizeof(double);  // retrieve the number of y values 
std::vector<double> y_values(count);
file.read(reinterpret_cast<char *>(y_values.data()), size);

R-Abbasi commented 1 week ago

The performance can supposedly be even faster than this by having captureData reformed to std::map<Omniscope::Id, std::vector<double>> captureData; The container now stores a map of Omniscope::Ids to a vector of std::pairs with the key being a (unsigned) number starting from zero increasing chronologically. This fixed pattern seemingly repeats for every device making it a good choice to be resolved. Noted and suggested that earlier to @AKMaily and we're now dealing only with the y values in the FFT external function (API).

The cast from double to string

I don't see a conversion from double to std::string. Where is it, please?

NP199 commented 1 week ago

I don't see a conversion from double to std::string. Where is it, please?

Not in your new version, that was comment about the old code. We can't change the data format into some binary stream until we give the user the possibility to load the data and have a look at it. Right now the user has to use some third party software and wen we also change the data format into something they can`t interpret, that wouldn't be a good idea.

In the near future i agree with you style, but don`t think there is a performance problem if we throw away the yvalues. If we want to give the user the possibility to just change a fraction of the plot, then they need to know which one they saved.

R-Abbasi commented 1 week ago

More time in the old code would go to write data into the file using the formatted operator<<, I believe and there's no file observation happening now. The file is chosen in the file explorer and taken into the process.
There're merely numbers there, not sure what's going to be achieved by seeing them. Even if there should be a chance to look at the file of which I wasn't being informed, I suggest that we keep the current performance and work on a solution for the user side.

y values will be kept; x values are to be removed. However, they can be set locally easily where they're required.

R-Abbasi commented 4 days ago

Reached the following solutions:

It's doable to make use of the binary format and its good performance but to convert the file to a human-readable version we may need to load, read, and then write it back to its initial form using stringifing which will bring us to the first place where formatted saving got much time. There're of course some tools to handle the binary file and work it out like ldd, Hexdump, od, xxd but none composed an appropriate output as I tested. Console printing gave better result:

There's a C++23 feature std::basic_string<CharT,Traits,Allocator>::resize_and_overwrite that can save us from the binary and is yet fast enough:

static void save(const Omniscope::Id &device,
             const std::vector<std::pair<double, double>> &values,
             fs::path const &outFile, std::string allData) {
allData += fmt::format("\n{}-{}\n", device.type, device.serial);
std::string fileContent;

fileContent.resize_and_overwrite(
// dedicate five bytes for each y_value, three for the number itself 
// and two spaces as a separator between the numbers
  values.size() * 5, [&values](char *begin, std::size_t) {
    auto end = begin;
    for (const auto &pair : values) {
      end = std::to_chars(end, end + 3, pair.second).ptr;
      *end++ = ' ';
      *end++ = ' ';
    }
    return end - begin;
  });

// create a .csv file to write to it
std::ofstream file(outFile, std::ios::app);
if (!file.is_open()) {
file.clear();
fmt::print("Could not create {} for writing!\n", outFile.string());
return;
}

fmt::print("Start saving {}.\n", outFile.string());
file << allData << fileContent;
file.flush();
file.close();
}

It only took 6 seconds for a 10-minute measurement with about 60'000'000 y_values.

Both solutions result in a fast and correct output but the latter is simpler, terser and more preferable, to me.

skunkforce / OmniView

Refactoring: Saving data #128