open-watcom / open-watcom-v2

Open Watcom V2.0 - Source code repository, Wiki, Latest Binary build, Archived builds including all installers for download.
Other
991 stars 163 forks source link

enhancement: Increase the speed of pthread_mutex_lock / pthread_mutex_unlock #1342

Open winspool opened 1 month ago

winspool commented 1 month ago

I saw an article about the speed of pthread_mutex implementations ( https://justine.lol/mutex/ ) and tried the mentioned test program (high contended scenario) with OpenWatcom. The test program uses one counter, 30 threads and 100000 iterations per thread.

Yes, that is just a test program, that makes an implementation detail visible, but with the default thread count, the OpenWatcom compiled programm uses 100% cpu, until i canceled it after ~5 minutes.

The test was created to compare the mutex implementation of the cosmopolitan libc (64 bit) and a part of the mutex implementation is based on nsync: https://github.com/google/nsync (nsync supports many more cpu`s, including x86, alpha, mips, ppc, ...)

When using all Hardware threads (12 on my system), the mutex implementation of cosmopolitan (64bit) is more than twice as fast as glibc (64bit), and the mutex in cosmopolitan is more than 100 times faster as the mutex in OpenWatcom (32bit). OpenWatcom is still about 35 times slower as glibc (32bit)

12 threads (supported by my hardware): 
cosmopolitan libc(64bit): 16,2ms
glibc(64bit): 36,3ms / glibc(32bit): 47,0ms
OpenWatcom: 1682ms (1,68s)

When the program uses more threads as available in hardware, the slowdown factor for the OpenWatcom compiled program increases dramatically to about ~ 7400 compared to cosmopolitan(64bit) ~ 2500 compared to glibc(32bit)

13 threads:
cosmopolitan libc(64bit): 17,3ms
glibc(64bit): 39,9ms / glibc(32bit): 50,4ms
OpenWatcom: 128819ms (128,8s)

Even when using only 4 threads, OpenWatcom is ~59 times slower than glibc(32bit) With only one or two threads, OpenWatcom is >100 times slower as glibc(32bit)

I also retested with 25 threads:

cosmopolitan libc(64bit): 31,4ms
glibc(64bit): 77,6ms / glibc(32bit): 89,7ms
OpenWatcom: 694362ms (694s) (A second run took 705 seconds)

########

The worker code for the thread is simple, but really slow, when compiled with OpenWatcom.

pthread_mutex_t g_locker = PTHREAD_MUTEX_INITIALIZER;
int g_chores;

void *worker(void *arg) {
    for (int i = 0; i < ITERATIONS; ++i)
    {
        pthread_mutex_lock(&g_locker);
        ++g_chores;
        pthread_mutex_unlock(&g_locker);
    }
    return 0;
}

The Test program was adapted to work with OpenWatcom (rusage is not available): pthread_mutex.c.txt

The implementation of pthread_mutex_lock is in ptmutex.c The Author mentioned at the head of the file is J. Armstrong. Is he still around?

pthread_mutex_lock uses sem_wait and sem_wait uses the futex syscall.

Is there something that can be changed, to handle some sync cases faster?

Is it possible to use some code or logic from other pthread_mutex_* or other semaphore implementations?

winspool commented 1 month ago

Some links to Futex and CPU cache related documents:

Jens Gustedt "Futex based locks for C11’s generic atomics"

Ulrich Drepper's paper "Futexes Are Tricky" Ulrich Drepper's paper "What Every Programmer Should Know About Memory"

ArmstrongJ commented 3 weeks ago

The pthread mutex code I implemented is a bit of a mess mostly because I did run into some issues with futexes. The issue isn't using futexes, it has something to do with the implementation of semaphores/futexes in OW. I remember it being a problem when I committed the code, but I don't remember the specifics. It should "work," but there is a time-consuming workaround if I'm not mistaken.