openppl-public / ppl.llm.serving

Apache License 2.0
122 stars 13 forks source link

Example demo stuck (ppl_llm_server/client_sample) #44

Closed Kwinpeng closed 9 months ago

Kwinpeng commented 9 months ago

When I run the provided example by:

[server]
./ppl-build/ppl_llm_server llama_7b_config_example.json

[client]
./ppl-build/client_sample 127.0.0.1:23333

The program got stuck and not proceed.

After a 2-hour debugging, I found that this maybe caused by miss-use of StaticThreadPool. The worker thread is implemented as:

void* StaticThreadPool::ThreadWorker(void* arg) {
    auto info = (ThreadInfo*)arg;
    auto pool = info->pool;

    while (true) {
        pool->barrier_.Wait();
        if (!pool->func_) {
            break;
        }
        pool->func_(pool->threads_.size(), info->thread_idx);
        pool->barrier_.Wait();
    }

    return nullptr;
}

There are two barriers: one before func_ run and the other is after. However, for asynchronism, the asynchronous method such as StaticThreadPool::RunAsync only call barrier(Wait()) once. So, it leaves to caller to take another barrier(Wait()) before the next run. And this is also noted by author in comment.

It seems strange to find that there may be still missing the SECOND barrier. When I add a pool->Wait() at two places, the program starts to proceed. It seems that client and server interact with each other normally.

The two modifications are:

[utils.h]

template <typename TaskType, typename... TaskArgType>
ppl::common::RetCode ParallelExecute(ppl::common::StaticThreadPool* pool, TaskArgType&&... rest_args) {
    auto n = pool->GetNumThreads();
    ppl::common::Barrier finish_barrier;
    ppl::common::RetCode thr_rc[n];
    finish_barrier.Reset(n + 1);

    pool->RunAsync([&](uint32_t nthr, uint32_t ithr) {
        auto task = TaskType(ithr, std::forward<TaskArgType>(rest_args)...);
        thr_rc[ithr] = task.Process();
        finish_barrier.Wait();
    });

    finish_barrier.Wait();

    ppl::common::RetCode retcode = ppl::common::RC_SUCCESS;
    for (uint32_t i = 0; i < n; ++i) {
        if (thr_rc[i] != ppl::common::RC_SUCCESS) {
            LOG(ERROR) << "ParallelExecute task[" << i << "] failed";
            retcode = thr_rc[i];
            break;
        }
    }

    pool->Wait();

    return retcode;
}

and

[llama_worker.cc]

if (is_first_run_) {
    is_first_run_ = false;
} else {
    decoder_barrier_.Wait();
    decoder_thread_pool_.Wait();
}

So I'd like to know if this modification is correct and is reasonable according to the design?

Current output of the provided example:

out
Alcanderian commented 9 months ago

Yes, this design is unreasonable and I told @Vincent-syr last week. The problem has been fixed at #35, you can check the latest code.

We cancelled the SECOND barrier so it won't be stuck on multi-GPUs inferencing.

Kwinpeng commented 9 months ago

Yes, this design is unreasonable and I told @Vincent-syr last week. The problem has been fixed at #35, you can check the latest code.

We cancelled the SECOND barrier so it won't be stuck on multi-GPUs inferencing.

I pulled the latest master branch, it's going good now!