ornladios / ADIOS2

Next generation of ADIOS developed in the Exascale Computing Program
https://adios2.readthedocs.io/en/latest/index.html
Apache License 2.0
268 stars 125 forks source link

SST engine: Make QueueLimit be considered at BeginStep instead of EndStep #3461

Open franzpoeschel opened 1 year ago

franzpoeschel commented 1 year ago

Why is this feature important? I am setting up a simulation workflow using SST that has a large memory footprint, hence I specify "QueueLimit" = "1". However, the second step is still opened, and accordingly a second buffer is allocated. The writer is only blocked at Engine::EndStep(), effectively implying a memory footprint as if specifying QueueLimit=2.

Attached is the memory footprint of a simulation that outputs 4 ADIOS2 steps, each step with a size of 40Gb per rank. Each step after the first has a 40Gb higher memory footprint compared to the first, due to two steps being held in memory.

Bildschirmfoto vom 2023-02-02 11-16-24

In this picture, the red area (arrows) is held by the SST buffer, the rest (orange and others) is PIConGPU memory outside ADIOS2.

What is the potential impact of this feature in the community? Memory-sensitive setups can make better use of SST. The simulation is blocked at a place that makes more sense (blocking at EndStep does not make a lot of sense; at this point the data can also be published for the benefit of fast readers).

Is your feature request related to a problem? Please describe. described above

Describe the solution you'd like and potential required effort Instead of blocking the Engine::EndStep() call of the second step, it would be helpful for us if the Engine::BeginStep() would already be blocked. Potential effort depends on the implementation and internal logic of SST, from my superficial point of view, the effort might be relatively small.

Describe alternatives you've considered and potential required effort There is effectively no way currently to specify a QueueLimit that is really 1. Specifying a QueueLimit n implies a memory usage of (n+1) times the buffer size.

Additional context

I reused the example from #3453 for a small demonstration of this behavior: Writer

#include <adios2.h>
#include <numeric>
#include <vector>

int main(int argsc, char **argsv)
{
    std::string engine_type = "sst";

    adios2::ADIOS adios{};
    adios2::IO IO = adios.DeclareIO("IO");
    IO.SetParameter("DataTransport", "WAN");
    IO.SetParameter("QueueLimit", "1");
    IO.SetParameter("InitialBufferSize", "100Mb");
    IO.SetParameter("Profile", "Off");
    IO.SetEngine(engine_type);
    adios2::Engine engine = IO.Open("stream", adios2::Mode::Write);
    std::vector<int> v(10, 17);
    auto var = IO.DefineVariable<int>("var", {10}, {0}, {10});
    std::vector<std::string> vecstring{"x", "y", "z"};
    IO.DefineAttribute("vecstring", vecstring.data(), vecstring.size());

    for (unsigned step = 0; step < 10; ++step)
    {
        engine.BeginStep();
        std::cout << "Began step " << step << std::endl;
        engine.Put(var, v.data());
        std::cout << "Closing step " << step << std::endl;
        engine.EndStep();
        std::cout << "Closed step " << step << std::endl;
    }
    std::cout << "Closing engine" << std::endl;
    engine.Close();
}

Reader:

#include <adios2.h>
#include <iostream>
#include <string>
#include <vector>
#include <unistd.h>

int main(int argsc, char **argsv)
{
    std::string engine_type = "sst";

    adios2::ADIOS adios{};
    adios2::IO IO = adios.DeclareIO("IO");
    IO.SetParameter("DataTransport", "WAN");

    IO.SetEngine(engine_type);
    adios2::Engine engine = IO.Open("stream", adios2::Mode::Read);

    std::vector<int> v(10);

    while (engine.BeginStep() == adios2::StepStatus::OK)
    {
        if(engine.CurrentStep() == 0)
        {
            sleep(15);
        }
        auto var = IO.InquireVariable<int>("var");
        var.SetSelection({{0}, {10}});
        engine.Get(var, v.data());
        engine.EndStep();
        std::cout << "In Step " << engine.CurrentStep() << ": ";
        for(auto val : v)
        {
            std::cout << val << ", ";
        }
        std::cout << std::endl;
        auto attr = IO.InquireAttribute<std::string>("vecstring");
        std::cout << "Data in vecstring: ";
        for (auto const & str : attr.Data())
        {
            std::cout << str << ", ";
        }
        std::cout << std::endl;
    }
    engine.Close();
}

The reader sleeps for 15 seconds after opening the first step. The output up until 15 seconds is the following, implying that the writer is allowed to create 2 steps:

> ./stream_read & ./stream_write 
[2] 1117217
Began step 0
Closing step 0
Closed step 0
Began step 1
Closing step 1
eisenhauer commented 1 year ago

Unfortunately, BeginStep is not a collective call on any engine as far as I can tell, but instead just does local operations. Giving it the ability to block on queue size would at least require making it collective so that all ranks could do the same thing. I think that the other engines are done with data by the time they exit EndStep, with SST being unique in that it's not. Maybe we could make that argument that it's OK for BeginStep to be collective in SST and not in others? Maybe only collective if there's queue limit set? Or maybe we should introduce another call that might block waiting for the last timestep to be consumed before continuing (which would be more flexible than just making that be BeginStep)?

franzpoeschel commented 1 year ago

Hello Greg, thanks for the answer. To be fair, I was not aware that BeginStep() is not collective, so far we have always treated it as a collective call. So, speaking from our perspective only, actually making it collective in SST is no problem.

Maybe only collective if there's queue limit set? Or maybe we should introduce another call that might block waiting for the last timestep to be consumed before continuing (which would be more flexible than just making that be BeginStep)?

Maybe an engine parameter? E.g., as new options for QueueFullPolicy: BlockAtBeginStep, BlockAtEndStep.