scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.37k stars 1.55k forks source link

`dma_write` silently drops `iovec`s exceeding `IOV_MAX` #2023

Open sukhodolin opened 10 months ago

sukhodolin commented 10 months ago

The program at the bottom reproduces the problem: when the number of iovecs to the dma_write function exceeds IOV_MAX, the remaining entries are silently dropped.

The file is expected to contain 8 megabytes of the letter 'A'. It, however, only contains 4 megabytes of it, and then the rest is zeroes on my machine. To confirm, try doing

> head -c 4194304 ./output.file | tail -c -1
A
> head -c 4194305 ./output.file | tail -c -1
<no output>

The reason is that the writev's documentation says that the number of iovecs shouldn't exceed the IOV_MAX value (which is 1024 on Linux), while we clearly give more than IOV_MAX iovecs to the dma_write function (that I believe is implemented in terms of writev).

So, the behavior of the seastar is correct in the sense that it matches the documentation and it's the program's fault that it violated the limit on iovec entries. But the problem is extra hard to debug, so maybe it's worth going an extra mile and adding a check (maybe just to a debug version of seastar if performance seems an issue) to throw an exception here? This will make it way easier for a developer to spot an error.

The source code for the reproduction of the issue:

#include <seastar/core/app-template.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/file.hh>
#include <seastar/core/reactor.hh>

seastar::logger lg("iov-max");

const char *OUTPUT_FILE_NAME = "output.file";

int main(int argc, char **argv) {
  seastar::app_template app;

  return app.run(argc, argv, [&]() -> seastar::future<int> {
    auto output_file = co_await seastar::open_file_dma(
        OUTPUT_FILE_NAME,
        seastar::open_flags::wo | seastar::open_flags::create);

    // We're going to have 2048 blocks of 4096 bytes each.
    constexpr size_t BUFFER_BLOCKS = 2048;
    constexpr size_t BUFFER_BLOCK_SIZE = 4096;
    constexpr size_t BUFFER_SIZE = BUFFER_BLOCKS * BUFFER_BLOCK_SIZE;

    co_await output_file.truncate(BUFFER_SIZE);
    co_await output_file.allocate(0, BUFFER_SIZE);

    auto blocks = seastar::allocate_aligned_buffer<char>(
        BUFFER_SIZE, output_file.memory_dma_alignment());
    for (size_t i = 0; i < BUFFER_SIZE; ++i) {
      blocks[i] = 'A';
    }

    std::vector<iovec> iovecs;
    for (size_t i = 0; i < BUFFER_BLOCKS; ++i) {
      const char *current_block = blocks.get() + BUFFER_BLOCK_SIZE * i;
      iovecs.emplace_back(iovec{(void *)current_block, BUFFER_BLOCK_SIZE});
    }

    lg.info("Writing {} iovecs", iovecs.size());

    co_await output_file.dma_write(0, iovecs);
    co_await output_file.flush();

    co_return 0;
  });
}
avikivity commented 10 months ago

Did you check the return type of dma_write()? If it returned the amount of data actually written, then it didn't silently drop anything.