Rewrite threads, mutexes, semaphores to be OS-specific for performance reasons

Jimmio92 commented 1 year ago

Describe the project you are working on

A procedurally generated open world RPG was the original goal -- who knows what the final form will take. Originally, I was simply opening a port using UPNP class; since it sleeps the thread until it finishes.. got a warning that said I "had to call wait_to_finish", and I dug deeper...

Describe the problem or limitation you are having in your project

Encountered a bug in a warning in Thread; seen C++ standard library threads being used and wept for Windows user's performance and "black box voodoo" nature of C++ library since it's different on each system.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Win32 threads have multiple ways to create them; and only one way is technically valid thanks to a bug. Who knows what the C++ library is going to use? _beginthreadex is the only safe version. CreateThread and others can overwrite static memory internally from a second call to CreateThread, thus returning invalid handles. We can hope and pray that the standard library is thoroughly hardened and tested for this -- or we can write it ourselves -- which we should do anyway because of the mutex potential speedup.

Win32 mutex is implemented at the kernel level. Every time a lock/unlock happens, a jump to ring level 0 happens, and this time is quite significant. This is most likely what std::mutex will use. Win32 has CriticalSection which acts very much like a mutex, however it is a spin lock. There's now a great chance of not needing to enter the kernel ring, thus saving huge amounts of time in naively written code, and worst case is as slow as a Win32 mutex.

I did not have to write an example (thank you, StackOverflow)

... for 1,000,000 uncontended acquires and releases, a mutex takes over one second. A critical section takes ~50 ms for 1,000,000 acquires. ...

this is a >20x speedup -- and with Juan's recent post about AAA-readiness, it makes a lot of sense to me to take advantage of speed where able, even if it's not likely to be used much by hobbyists.

Further, I've been informed that it previously was written in an OS-specific manner before, however it was decided that it would be better to trust the black-box that is the standard C++ library for this, despite it also not offering a way to cancel/kill a thread, and despite the likely speed concerns. It seems quite backwards to throw out working code just to rewrite it using something that ends up being a more poor fit for the job, and if this proposal is accepted, it will be a third time.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

Most platforms support pthreads. Use them everywhere except Windows. Windows: use _beginthreadex for threading, Win32 CriticalSection for mutex, Win32 Semaphore for semaphore.

#define WIN32_LEAN_AND_MEAN 1
#include <Windows.h>
#include <process.h>
#include <synchapi.h>

#include <cstdint>

//declaration (headers)
//os.hpp
typedef void* NativeHandle;
struct Thread {
    typedef void* (*Func)(void*);

    NativeHandle handle;
    Func func;
    void *data;

    Thread();
    ~Thread();

    bool can_start() const;
    bool start(Func, void*);
    bool has_started() const;

    bool can_join() const;
    bool try_join();
    void* join();

    bool can_detach() const;
    void detach();
    bool has_detached() const;

    bool can_cancel() const;
    void cancel();
};
struct Mutex {
    NativeHandle handle;

    Mutex();
    ~Mutex();

    void lock();
    bool try_lock();
    void unlock();
};
struct Semaphore {
    NativeHandle handle;

    Semaphore();
    ~Semaphore();

    void post();
    void wait();
    bool try_wait();
};

//implementation (source) follows
//win32/thread.cpp
static uint32_t win32_thread_begin(void *obj) {
    Thread *t = (Thread*)obj;

    t->data = t->func(t->data);
    return 0;
}

Thread::Thread(): handle(nullptr), func(nullptr), data(nullptr) { }
Thread::~Thread() {
    if(has_started()) {
        //detach is what we should do; if the app is closing, the OS will
        //terminate threads still running -- this is unsafe for us to do
        //ourselves here
        detach();
    }
}

bool Thread::can_start() const {
    return handle == nullptr;
}
bool Thread::start(Thread::Func fn, void *data) {
    if(!can_start()) {
        return false;
    }

    //set data in thread object
    this->func = fn;
    this->data = data;

    //spawn thread
    handle = (NativeHandle)_beginthreadex(
        nullptr, //default security
        4096, //stack size TODO: customizable?
        win32_thread_begin,
        (void*)this,
        0, //initflag
        nullptr //thrdaddr
    );
    //if handle is 0, it failed
    if((size_t)handle == 0) {
        //TODO: Error; sets errno and _doserrno
        //EAGAIN too many threads, EINVAL if stack size incorrect -- arg invalid
        //EACCES insufficient resources (memory)
        return false;
    }

    return true;
}
bool Thread::has_started() const {
    return handle != nullptr;
}

bool Thread::can_join() const {
    return  has_started()
    &&      (WaitForSingleObject((HANDLE)handle, 0) == WAIT_OBJECT_0);
}
bool Thread::try_join() {
    if(can_join()) {
        join();
        return true;
    }

    return false;
}
void* Thread::join() {
    DWORD r = WaitForSingleObject((HANDLE)handle, INFINITE);
    switch(r) {
    case WAIT_OBJECT_0:
        return data;
    }

    return nullptr;
}

bool Thread::can_detach() const {
    return has_started();
}
void Thread::detach() {
    if(!can_detach()) return;

    //win32 detach means closing the handle; does NOT cancel a thread!
    CloseHandle((HANDLE)handle);
    //VITAL: Set handle to nullptr so can_detach, can_join fail on this thread
    //object; does handle need to be atomic?
    handle = nullptr;
}
bool Thread::has_detached() const {
    return !has_started();
}

bool Thread::can_cancel() const {
    return has_started();
}
void Thread::cancel() {
    if(!can_cancel()) {
        return;
    }

    BOOL b = TerminateThread((HANDLE)handle, 0);
    detach();
}

//win32/mutex.cpp
Mutex::Mutex(): handle(nullptr) {
    //allocate memory for CRITICAL_SECTION
    //TODO: should this be in a pool tagged os?
    void *mem = std::malloc(sizeof(CRITICAL_SECTION));
    if(mem == nullptr) {
        //TODO: use our own logging system
        std::perror("Mutex::Mutex()");
        return;
    }

    //initialize CRITICAL_SECTION
    InitializeCriticalSection((LPCRITICAL_SECTION)mem);

    //set handle to memory ptr
    handle = (NativeHandle)mem;
}
Mutex::~Mutex() {
    if(handle != nullptr) {
        DeleteCriticalSection((LPCRITICAL_SECTION)handle);
        std::free(handle);
        handle = nullptr;
    }
}

void Mutex::lock() {
    EnterCriticalSection((LPCRITICAL_SECTION)handle);
}
bool Mutex::try_lock() {
    //TryEnterCriticalSection returns zero when another thread owns it
    return TryEnterCriticalSection((LPCRITICAL_SECTION)handle) != 0;
}

void Mutex::unlock() {
    LeaveCriticalSection((LPCRITICAL_SECTION)handle);
}

//win32/semaphore.cpp
Semaphore::Semaphore(): handle(nullptr) {
    handle = (NativeHandle)CreateSemaphoreExW(
        nullptr, //default security attributes
        0, //initial count
        2147483647, //32 bit long max
        nullptr, //LPCWSTR name -- allows IPC
        0,
        DELETE | SYNCHRONIZE | SEMAPHORE_MODIFY_STATE
    );
}
Semaphore::~Semaphore() {
    CloseHandle((HANDLE)handle);
}

void Semaphore::post() {
    ReleaseSemaphore((HANDLE)handle, 1, nullptr);
}
void Semaphore::wait() {
    WaitForSingleObject((HANDLE)handle, INFINITE);
}
bool Semaphore::try_wait() {
    return WaitForSingleObject((HANDLE)handle, 0) == 0;
}

This is pulled from my own hobby engine's core library as it currently stands and smashed into one "file" here that should compile and be useful for any win32 implementation. All rights to my code here provided to any Godot devs that are interested in tackling this.

If this enhancement will not be used often, can it be worked around with a few lines of script?

As this is a core performance related proposal, no it cannot be worked around.

Is there a reason why this should be core and not an add-on in the asset library?

This is a proposal for modification to a core component.

RandomShaper commented 1 year ago

Regarding performance, can you provide an example of a project that ended up with performance issues because of the Windows implementation of std::mutex, or performance measurement of the Godot editor that show how much time is wasted locking/unlocking during, say, importing, or some other real-world case?

Regarding C++ threads not being killable, that's OK. You don't want to kill them anyway, but to let them finish and perform proper cleanup.

Also, as I already told, Godot has its own spin lock implementation, used in performance-critical areas.

I can share some of your point about being a bit uneasy about how the C++ library is implementing thread and sync classes, but so far it has worked perfectly.

Jimmio92 commented 11 months ago

Considering during a release candidate, a threading lib was added? Turns out this still needed done. I don't understand why I bother trying to contribute, I really don't.

Threads need to be killable. Ever hear of a cancel button on a long operation?

godotengine / godot-proposals