vsg-dev / VulkanSceneGraph

Vulkan & C++17 based Scene Graph Project
http://www.vulkanscenegraph.org
MIT License
1.32k stars 212 forks source link

issue with OperationThreadQueue::Latch #164

Closed mp3butcher closed 4 years ago

mp3butcher commented 4 years ago

following https://github.com/vsg-dev/VulkanSceneGraph/pull/163 rejection Describe the bug MT Freeze (Latch stays locked after count reached 0)

To Reproduce

#include <vsg/all.h>

struct SimpleOp : public vsg::Operation
{
    SimpleOp(vsg::ref_ptr<vsg::Latch> l): latch(l) {}

    void run() override
    {
        latch->count_down();
    }

    vsg::ref_ptr<vsg::Latch> latch;
};

int main(int argc, char** argv)
{
    vsg::ref_ptr<vsg::OperationThreads> threads(new vsg::OperationThreads(1));
    threads->run();
    for(int i =0;i<1000000;++i)
    {
        vsg::ref_ptr<vsg::Latch> latch(new vsg::Latch(1));
        vsg::ref_ptr<SimpleOp> op(new SimpleOp(latch));
        threads->add(op);
        latch->wait();
    }
}

Expected behavior to exit properly

mp3butcher commented 4 years ago

@robertosfield Considering the time since which I contribute to your projects, I'd expect a little more respect ...

robertosfield commented 4 years ago

To provoke a bug you first subclassed from vsg::Latch and then suggest it's a bug in Latch.

This looks like it breaks the design of Latch and which is why it's not working for you. You implementation of reset looks invalid to me, it ignores the condition variable that needs to be managed correctly.

Next week I'll review Latch and if a reset() looks a desirable feature I'll implement it without breaking it and without needing the crazy refactor that you've proposed.

On future if you want a new feature added to a class like Latch go ask a question on vsg-users rather than presume that your changes are valid usage.

On Sat, 11 Apr 2020 at 15:27, Julien Valentin notifications@github.com wrote:

@robertosfield https://github.com/robertosfield Considering the time since which I contribute to your projects, I hope a little more respect ...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vsg-dev/VulkanSceneGraph/issues/164#issuecomment-612433344, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKEGUE57MBMRIXKRJQACLTRMB44RANCNFSM4MGAJ6LA .

mp3butcher commented 4 years ago

look issue on github please...

robertosfield commented 4 years ago

What happens when you remove the unnecessary repeated call to threads->run()? On call prior to the loop is all that should be done.

Is this only there to delay the call it latch->watch() in the main thread?

Is so then try sticking a sleep or other delay call as this would more reliably recreate the possible problem condition.

Next week when I'm back at work I'll investigate.

mp3butcher commented 4 years ago

What happens when you remove the unnecessary repeated call to threads->run()? On call prior to the loop is all that should be done.

same issue

Is this only there to delay the call it latch->watch() in the main thread?

no, I just wrote it too quickly

robertosfield commented 4 years ago

I had some time available today so I recreate the test case above and got it to recreate the hang, I then moved a Latch implementation local to the cpp, so it can be easily modified, and changed the code across to using ::create rather then new as it's good practice VSG usage (less likely to forgot to use ref_ptr<>> and it also means we can use auto for better code readability.

As a workaround for the hang I found that using std::condition_variable::wait_for() rather than wait() prevents the hang. bit it doesn't actually address the condition that is causing the hang so I would describe this as a workaround rather than a fix. I'd like to get to the bottom of why the count_down() and wait() are failing occasionally. My guess is there some issue with the atomic variable usage that need tightening up. I have some ideas but will tinker with these in the week.

With this version of the test program with a debug build I'm typically getting around 0-10 timeouts reported. My guess is that the timeout in this wait_for() version would a hang in the wait() version for each of these timeouts. With a release build I'm get far fewer timeout warnings, I have to use 10x number of iterations to provoke any.

`

include <vsg/all.h>

class MyLatch : public vsg::Inherit<vsg::Object, MyLatch> { public: MyLatch(size_t num) : _count(num) {}

void count_up()
{
    ++_count;
}

void count_down()
{
    --_count;
    if (is_ready())
    {
        std::unique_lock lock(_mutex);
        _cv.notify_all();
    }
}

bool is_ready() const
{
    return (_count == 0);
}

void wait()
{
    std::chrono::duration waitDuration = std::chrono::milliseconds(100);
    // use while loop to return immediate when latch already released
    // and to handle cases where the condition variable releases spuriously.
    while (_count != 0)
    {
        std::unique_lock lock(_mutex);
        if (_cv.wait_for(lock, waitDuration) == std::cv_status::timeout)
        {
            std::cout<<"_cv.wait_for(lock, waitDuration) timed out"<<std::endl;
        }
    }
}

std::atomic_size_t& count() { return _count; }

protected: virtual ~MyLatch() {}

std::atomic_size_t _count;
std::mutex _mutex;
std::condition_variable _cv;

};

struct SimpleOp : public vsg::Inherit<vsg::Operation, SimpleOp> { SimpleOp(vsg::ref_ptr l): latch(l) {}

void run() override
{
    latch->count_down();
}

vsg::ref_ptr<MyLatch> latch;

};

int main(int argc, char** argv) { vsg::CommandLine arguments(&argc, argv);

auto numThreads = arguments.value(1, "-t");
auto numIterations = arguments.value(1000000, "-n");

auto threads = vsg::OperationThreads::create(numThreads);
threads->run();
for(int i =0; i<numIterations; ++i)
{
    auto latch = MyLatch::create(1);
    auto op = SimpleOp::create(latch);
    threads->add(op);
    latch->wait();
}

return 0;

} `

The CmakeFile.txt used for this test was: `cmake_minimum_required(VERSION 2.6)

PROJECT(latchbug)

FIND_PACKAGE(vsg)

SET(SOURCES main.cpp )

set(CMAKE_CXX_STANDARD 17)

ADD_EXECUTABLE(latchbug ${SOURCES})

TARGET_LINK_LIBRARIES(latchbug vsg::vsg) `

robertosfield commented 4 years ago

I've spent some time testing and reviewing the code and have implemented the follow wait() implementation. I'm currently running a stress test with 40 threads and many iterations. it's been running for over half an hour successfully iterating away, still only 20% through the stress test with hints of hangs. This fix is to move the check of the _count to after the mutex lock but before the call to _cv.wait(). If this stress test goes fine I'll check in the fix tomorrow:

void wait()
{
    // use while loop to return immediate when latch already released
    // and to handle cases where the condition variable releases spuriously.
    if (_count > 0)
    {
        for(;;)
        {
            std::unique_lock lock(_mutex);
            if (_count > 0) _cv.wait(lock);
            else break;
        }
    }
}
mp3butcher commented 4 years ago

Active wait is not of my education.... Without taking that into account, this pattern seems abstractly ok. I really would like to know the source of your aversion to mutex for a simple barrier...

robertosfield commented 4 years ago

You asking me to teach you mulit-threading?

A condition variable used in the above wait is not an active wait, the thread is put to sleep while the thread sits in the wait. The thread(s) sitting on the condition_variable::wait() is(are) only woken by the thread that calls count_down() condition_variable::notify_all(), or notify_one().

The for(;;) loop is only there because on some implementations the condition_variable::wait() can be released spuriously, as mentioned in the comment explaining this.

The condition_variable exists for this type of threading task, a mutex exists for another, an atomic for another, you use the most efficient tool for the task.

If you don't understand something yet, that's OK, just ask for an explanation. It's also important to not presume something is wrong because it's different to what your current level of skill/understanding leads you to assume. I'm 30 years into my career and I'm still learning.

robertosfield commented 4 years ago

I have simplified the fix further and applied two other refinements of the Latch:

https://github.com/vsg-dev/VulkanSceneGraph/commit/279d636995ee422ea9df03e1cfd3de324eae3f74

With these changes the latchbug test program runs successful at all times and when running within valgrind tools=helgrind reports no issues.

Please note, the fix part of this commit is a single line change - simply moving the lock outside the loop. It's also important to understand that the wait() doesn't sit on the mutex lock, the condition_variable::wait() will release the acquired mutex while it's waiting, then require it once it's been signaled to continue. The benift for use of condition_variable here is to avoid acquiring mutex locks for any longer than is absolutely necessary.

mp3butcher commented 4 years ago

https://github.com/vsg-dev/VulkanSceneGraph/commit/279d636995ee422ea9df03e1cfd3de324eae3f74

is very ok thanks