async_io_buffer does not appear to be strong-scalable on SeqAn 3.2.0

umeshwww commented 2 years ago

Hi,

I am observing a performance issue with the strong-scalability of async_input_buffer Here are the code snippet and observations of the test case compiled with -O3 flag. SeqAn3 Version: 3.2.0 OneAPI/TBB Version: 2021.5.1 Input dataset (BAM): H-Sapien: K562_cytosol_LID8465_TopHat_v from the compression benchmark repository [1]

struct record_t {
    explicit record_t(std::string id, std::vector<seqan3::dna5> sequence):
    _id{std::move(id)}, _sequence{std::move(sequence)} {}

    std::string _id{};
    std::vector<seqan3::dna5> _sequence{}; 
   // ... more data.
};

    task_arena arena{(int) threads};
    task_group task_group;
    seqan3::contrib::bgzf_thread_count = 4;                     // limit the number of decompression threads.
    const size_t io_buffer = 1;                                 // hardcoded for brevity.
    std::vector<std::vector<record_t>> local_records(threads);  // thread-local data-sets.

    for (const auto &filename: filenames) {
        seqan3::sam_file_input input_view{filename.c_str()};
        // Spawns a background thread that tries to keep <io_buffer> records in the buffer.
        auto bam_view = input_view | seqan3::views::async_input_buffer(io_buffer);
        for (size_t slot = 0; slot < threads; ++slot) {
            auto &records = local_records[slot];
            records.reserve(1000 * 1024);
            auto parse = [&bam_view, &records]() {
                for (const auto &record: bam_view) {
                    auto & sequence = record.sequence();
                    records.emplace_back(record.id(), std::move(sequence));
                }
            };
            task_group.run(parse);
        }
        task_group.wait();
    }

//... merge records ...

Observations

Please advise:

Is the test setup adequate or anything is missing in my configuration?
Is there any public data set available showing the scalability of async_io_buffer?

[1] Comparison of high-throughput sequencing data compression tools, Nature Methods

smehringer commented 2 years ago

Hi @umeshwww,

thanks for reaching out.

I think your benchmark setting isn't ideal for the seqan3::views::async_input_buffer and thus your results are expected.

First, I describe your set up how I understand it: (1) The async_input_buffer prefetches BAM records with a single thread in the background, (2) only prefetching 1 record at a time (io_buffer = 1). (3) You then consume the buffer with multiple threads (for loop), (4) consuming here is the task of moving record.id() and record.sequence().

Now let me explain some issues with that:

Regarding (2), prefetching only a single record at a time isn't really prefetching anything because the IO thread always waits until its one and only record is consumed before it fetches the next one. Quoting our documentation

if the underlying range is an input file over short reads, a buffer size of 100 or 1000 could be beneficial
Regarding (4), your consuming task is really fast. This means that the consuming threads will pop the records real fast from the bam_view. The bam_view will have trouble keeping up with the prefetching and IO will be the bottleneck of your setup.
The above leads to the conclusion that when increasing the threads for (3) it speeds up the consumption of records from the bam_view even more but the bam_view doesn't prefetch records any faster (it just can't!). So what happens is that your threads always ask the bam_view "Gimme more records!!" but the bam_view answers "working on it, no records right now", which will make the threads spin idle, actually decreasing the performance.

If your sole task is to move the content of records to a vector, the async_input_buffer is not what helps you speed up your program. You need an expensive consuming work load to have the IO be "for free" with the async_input_buffer in the background. For example, if your parse lambda would actually do something with the record, e.g. realigning the read sequence or computing some statistics, then the consumer threads are busy and the bam_view can fill your buffer in the meantime. Let's say IO takes 20 minutes and processing all records would take 60 minutes. First reading and then processing all records would take you 20 + 60 = 80 minutes. Moving IO to the background with async_input_buffer can speed this up because the 20 minutes are being done in the background, you only need 60 minutes in total (or maybe 65 since one thread is always busy). Increasing the number of threads can then decrease the 60 minutes of processing further.

I hope this explanation was clear :)

umeshwww commented 2 years ago

Hi @smehringer,

Thank you for a nicely detailed explanation. First, my apology for missing out on some background of this test case. For a minimalist statement of the query, I apparently dropped some crucial information. Briefly, the purpose is to observe read throughput; given that the data is already in memory, and there is negligible compute load per access to the record. In other words, how well does the tool scale in an I/O bound test? I agree with the explanations regarding (1 - 4) above. However, there were two essential points missing in my question above:

The data is placed in the local shared memory device (/dev/shm). So the I/O latency is bound by access to the main memory.
Secondly, following the documentation, I first ran the test case with the I/O buffer 1024, and then with 1M to observe the effect of buffer size. The case of buffer = 1 is included only to cover the scenario described in the documentation as quoted below.

a buffer size of 100 or 1000 could be beneficial; if on the other hand, the file contains genome-sized sequences, it would be better to buffer only a single sequence (buffering 100 sequences would result in the entire file being preloaded and likely consuming significant memory).

From our past observations with other tools, after eliminating the disk I/O, the decompression step prevails the factors influencing read throughput. So, with I/O buffer 1024, one may expect the speedup to be driven by the number of decompression threads [4 in this case]. However, the results invalidate this assumption. I have included those observations in the figure below.

So, this unfolds two more questions:

(a) Is the guideline of 100 - 1000 short reads, applicable to in-memory data sets, or is the prefetch strategy suitable only for disk/remote I/O?

(b) The term "genome-sized" sequence being application dependent, is there such a guideline for general long read data where the read-length may have large variance? e.g [500 - 30K] Better yet, it would be nice if we interpret buffer size in terms of bytes rather than the number of reads so that prefetched read count may vary adaptively.

SGSSGene commented 2 years ago

Hi @umeshwww

I have not recreated your scenario, but we have observed similar behavior in other circumstances. I have seen that another library based on SeqAn3 (https://github.com/biocpp/biocpp-io) for BCF files doesn't properly scale over multiple threads, probably relying on the same internals as our BAM implementation, sharing the same bottleneck. Especially, the BAM implementation in SeqAn3 is slow and not competitive.

Overall, the I/O in SeqAn3 doesn't have the quality/performance that we are striving for. This is also one of the reasons the I/O API hasn't reached stable yet. We are currently reworking this part. If you want to have a faster I/O right now, I can recommend checking out https://github.com/seqan/seqan. We are confident that our future performance of SeqAn3 will match that of SeqAn2.

umeshwww commented 2 years ago

Hi @SGSSGene,

Thank you for this update. We look forward to the future versions of SeqAn3. +1 for the reference to biocpp-io.

seqan / product_backlog

async_io_buffer does not appear to be strong-scalable on SeqAn 3.2.0 #421