max0x7ba / atomic_queue

C++ lockless queue.
MIT License
1.47k stars 176 forks source link

Implementations for ppc64le and s390x? #33

Closed musicinmybrain closed 2 years ago

musicinmybrain commented 2 years ago

I’ve packaged this library for Fedora Linux as part of an effort to eventually package sfizz.

Both ppc64le (little-endian 64-bit PowerPC) and s390x (Linux on IBM Z) are supported primary architectures in Fedora. It’s fine that they aren’t supported here—we can just ensure that atomic_queue and any packages that depend on it aren’t built on those architectures.

Still, I wonder—is supporting these two architectures easy or difficult?

Unfortunately, I’m not entirely confident in the exact requirements of the spin_loop_pause() function, nor am I particularly familiar with the nuances of those two architectures, so it’s not easy for me to contribute implementations myself.

max0x7ba commented 2 years ago

I think supporting those architectures should be fairly trivial - just add spin_loop_pause function that uses the CPU specific pause instruction, if the CPU has one. It should work even if spin_loop_pause is defined as an empty function.

If I had access to such machines I would be able to test it.

musicinmybrain commented 2 years ago

If I had access to such machines I would be able to test it.

While I don’t have direct interactive shell access on these architectures, I can run the atomic_queue tests on them by doing RPM “scratch builds” on the Fedora builders. Does that help?

max0x7ba commented 2 years ago

Building helps. Ideally, test and/or benchmark executables must be built and run. They both do consistency checks to verify that the queue operates as expected.

I have never cross-compiled, so am not sure how I would go about making changes to support these architectures and trying to compile them.

max0x7ba commented 2 years ago

Made some changes for you: https://github.com/max0x7ba/atomic_queue/commit/ee3d91cf131e56aa3302a43aeebf7b57a3c97b06

musicinmybrain commented 2 years ago

Thanks. I’ll run the tests right away, and I’ll investigate whether I can get the benchmarks (with their extra dependencies) to build as a one-time hack.

musicinmybrain commented 2 years ago

It’s my understanding that cache lines are 128 bytes on powerpc64, and 256 bytes on s390x. You can detect s390x with the __s390x__ preprocessor macro (https://sourceforge.net/p/predef/wiki/Architectures/).

musicinmybrain commented 2 years ago

The tests passed. (The RPMs for that build will expire after a while, and some or all of the logs might too.)

I’ll work on trying to build and run the benchmarks later.

max0x7ba commented 2 years ago

It’s my understanding that cache lines are 128 bytes on powerpc64, and 256 bytes on s390x. You can detect s390x with the __s390x__ preprocessor macro (https://sourceforge.net/p/predef/wiki/Architectures/).

https://github.com/max0x7ba/atomic_queue/commit/f2a36a392e02cc2643b07292e639c4a2ae7d4a35

Fell free to submit pull requests, I will approve promptly.

max0x7ba commented 2 years ago

The tests passed. (The RPMs for that build will expire after a while, and some or all of the logs might too.)

I’ll work on trying to build and run the benchmarks later.

benchmark may be too CPU-intense to run in continious integrations systems.

"Unit tests with gcc" in GitHub CI runs make -rj2 TOOLSET=gcc example run_tests, it is intended to be invoked by continious integrations systems.

musicinmybrain commented 2 years ago

"Unit tests with gcc" in GitHub CI runs make -rj2 TOOLSET=gcc example run_tests, it is intended to be invoked by continious integrations systems.

That should be the same as what I’m running via meson test, I think. I started running the example too, and that’s also fine for ee3d91cf131e56aa3302a43aeebf7b57a3c97b06: https://koji.fedoraproject.org/koji/taskinfo?taskID=77575442

I did try to build and run the benchmarks (again, as a one-off, since even if the runtime isn’t too long, I don’t want to package three other queue implementations). It seems like I can’t run them at all without the hugeadm comands recommended in README.md.

For f2a36a392e02cc2643b07292e639c4a2ae7d4a35, however, I’m getting a compiler error on the two new platforms: https://koji.fedoraproject.org/koji/taskinfo?taskID=77575138

g++ -Itests.p -I. -I.. -I../include -I/usr/include -fdiagnostics-color=always -DNDEBUG -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wnon-virtual-dtor -std=gnu++14 -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mcpu=power8 -mtune=power8 -fasynchronous-unwind-tables -fstack-clash-protection -DBOOST_TEST_DYN_LINK=1 -DBOOST_ALL_NO_LIB -pthread -MD -MQ tests.p/src_tests.cc.o -MF tests.p/src_tests.cc.o.d -o tests.p/src_tests.cc.o -c ../src/tests.cc
In file included from ../src/tests.cc:8:
../include/atomic_queue/atomic_queue.h: In instantiation of ‘class atomic_queue::AtomicQueueB2<std::unique_ptr<int> >’:
../src/tests.cc:136:42:   required from here
../include/atomic_queue/atomic_queue.h:518:19: error: static assertion failed: Unexpected SHUFFLE_BITS.
  518 |     static_assert(SHUFFLE_BITS, "Unexpected SHUFFLE_BITS.");
      |                   ^~~~~~~~~~~~
../include/atomic_queue/atomic_queue.h:518:19: note: ‘atomic_queue::AtomicQueueB2<std::unique_ptr<int> >::SHUFFLE_BITS’ evaluates to false
../include/atomic_queue/atomic_queue.h: In instantiation of ‘class atomic_queue::AtomicQueueB2<float, std::allocator<float>, true, true, true>’:
../src/tests.cc:160:21:   required from here
../include/atomic_queue/atomic_queue.h:518:19: error: static assertion failed: Unexpected SHUFFLE_BITS.
../include/atomic_queue/atomic_queue.h:518:19: note: ‘atomic_queue::AtomicQueueB2<float, std::allocator<float>, true, true, true>::SHUFFLE_BITS’ evaluates to false
../include/atomic_queue/atomic_queue.h: In instantiation of ‘class atomic_queue::AtomicQueueB2<float>’:
../include/atomic_queue/atomic_queue.h:593:8:   required from ‘struct atomic_queue::RetryDecorator<atomic_queue::AtomicQueueB2<float> >’
../src/tests.cc:179:71:   required from here
../include/atomic_queue/atomic_queue.h:518:19: error: static assertion failed: Unexpected SHUFFLE_BITS.
../include/atomic_queue/atomic_queue.h:518:19: note: ‘atomic_queue::AtomicQueueB2<float>::SHUFFLE_BITS’ evaluates to false
max0x7ba commented 2 years ago

It looks like the larger cache sizes weren't anticipated in other parts of the code. I will look into that in the next few days.

musicinmybrain commented 2 years ago

Ok, thanks for looking into this at all. I’m happy to test any further changes you come up with.

max0x7ba commented 2 years ago

e50e567ff86b029c6680e696d509c4b7bed13095 adds support for larger cache sizes.

max0x7ba commented 2 years ago

"Unit tests with gcc" in GitHub CI runs make -rj2 TOOLSET=gcc example run_tests, it is intended to be invoked by continious integrations systems.

That should be the same as what I’m running via meson test, I think.

That's right.

I did try to build and run the benchmarks (again, as a one-off, since even if the runtime isn’t too long, I don’t want to package three other queue implementations). It seems like I can’t run them at all without the hugeadm comands recommended in README.md.

It should run even if no hudeadm is installed and only issue a warning.

musicinmybrain commented 2 years ago

e50e567 adds support for larger cache sizes.

This seems to work on ppc64le, but I see a test failure on s390x (which did work with the cache line size set to 64 bytes):

+ /usr/bin/meson test -C redhat-linux-build --num-processes 3 --print-errorlogs --verbose
ninja: Entering directory `/builddir/build/BUILD/atomic_queue-e50e567ff86b029c6680e696d509c4b7bed13095/redhat-linux-build'
ninja: no work to do.
1/1 tests RUNNING       
>>> MALLOC_PERTURB_=23 /builddir/build/BUILD/atomic_queue-e50e567ff86b029c6680e696d509c4b7bed13095/redhat-linux-build/tests
1/1 tests FAIL            0.95s   (exit status 201 or signal 73 SIGinvalid)
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stdout:
Running 10 test cases...
../src/tests.cc(67): error: in "stress_BlockingAtomicQueue": check r > (expected_result / CONSUMERS) / 10 has failed [871616 <= 16666683333]
stderr:
*** 1 failure is detected in the test module "atomic_queue"

――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――
Summary of Failures:
1/1 tests FAIL            0.95s   (exit status 201 or signal 73 SIGinvalid)
Ok:                 0   
Expected Fail:      0   
Fail:               1   
Unexpected Pass:    0   
Skipped:            0   
Timeout:            0   
Full log written to /builddir/build/BUILD/atomic_queue-e50e567ff86b029c6680e696d509c4b7bed13095/redhat-linux-build/meson-logs/testlog.txt
musicinmybrain commented 2 years ago

It should run even if no hudeadm is installed and only issue a warning.

I thought that was the intent, but it doesn’t seem to be true in practice:

+ redhat-linux-build/benchmarks
Warning: Failed to allocate 1GB huge pages. Run "sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1".
Warning: Failed to allocate 2MB huge pages. Run "sudo hugeadm --pool-pages-min 2MB:16 --pool-pages-max 2MB:16".
terminate called after throwing an instance of 'std::system_error'
  what():  mmap: Resource temporarily unavailable
/var/tmp/rpm-tmp.ZcvJSj: line 35:   356 Aborted                 (core dumped) redhat-linux-build/benchmarks

Okay, the warnings are just warnings, but looking further at the system error:

$ gdb ./benchmarks
[blah blah blah]
(gdb) run
Starting program: /home/ben/fedora/mine/atomic-queue/atomic_queue-f2a36a392e02cc2643b07292e639c4a2ae7d4a35/x86_64-redhat-linux-gnu/benchmarks 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Warning: Failed to allocate 1GB huge pages. Run "sudo hugeadm --pool-pages-min 1GB:1 --pool-pages-max 1GB:1".
Warning: Failed to allocate 2MB huge pages. Run "sudo hugeadm --pool-pages-min 2MB:16 --pool-pages-max 2MB:16".
terminate called after throwing an instance of 'std::system_error'
  what():  mmap: Resource temporarily unavailable

Program received signal SIGABRT, Aborted.
0x00007ffff7b732a2 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.33-20.fc34.x86_64 libgcc-11.2.1-1.fc34.x86_64 libstdc++-11.2.1-1.fc34.x86_64 tbb-2020.3-7.fc34.x86_64
(gdb) bt
#0  0x00007ffff7b732a2 in raise () from /lib64/libc.so.6
#1  0x00007ffff7b5c8a4 in abort () from /lib64/libc.so.6
#2  0x00007ffff7de2a46 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3  0x00007ffff7dee29c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007ffff7dee307 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007ffff7dee5a9 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x0000555555560801 in atomic_queue::HugePages::HugePages (t=<optimized out>, size=33554432, this=0x7fffffffd730)
    at ../src/huge_pages.cc:55
#7  main () at ../src/benchmarks.cc:564
max0x7ba commented 2 years ago

e50e567 adds support for larger cache sizes.

This seems to work on ppc64le, but I see a test failure on s390x (which did work with the cache line size set to 64 bytes):

../src/tests.cc(67): �[1;31;49merror: in "stress_BlockingAtomicQueue": check r > (expected_result / CONSUMERS) / 10 has failed [871616 <= 16666683333]�[0;39;49m
stderr:
�[1;31;49m*** 1 failure is detected in the test module "atomic_queue"
�[0;39;49m

False positives are possible in this particular test. It requires 6 participating threads to be running simultenously uninterrupted.

I made this check a warning now. 7013a8b46c9a89ffeb0f9ebd20c9b4ff2c3bd47e

max0x7ba commented 2 years ago

It should run even if no hudeadm is installed and only issue a warning.

I thought that was the intent, but it doesn’t seem to be true in practice:

terminate called after throwing an instance of 'std::system_error'
  what():  mmap: Resource temporarily unavailable

Program received signal SIGABRT, Aborted.
0x00007ffff7b732a2 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.33-20.fc34.x86_64 libgcc-11.2.1-1.fc34.x86_64 libstdc++-11.2.1-1.fc34.x86_64 tbb-2020.3-7.fc34.x86_64
(gdb) bt
#0  0x00007ffff7b732a2 in raise () from /lib64/libc.so.6
#1  0x00007ffff7b5c8a4 in abort () from /lib64/libc.so.6
#2  0x00007ffff7de2a46 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /lib64/libstdc++.so.6
#3  0x00007ffff7dee29c in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6
#4  0x00007ffff7dee307 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007ffff7dee5a9 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x0000555555560801 in atomic_queue::HugePages::HugePages (t=<optimized out>, size=33554432, this=0x7fffffffd730)
    at ../src/huge_pages.cc:55
#7  main () at ../src/benchmarks.cc:564

It falls back to default page size. However, it uses MAP_LOCKED and EAGAIN says that it cannot lock the pages. That's ulimit -l which may need to be set to a higher value. On my systems I have it set to unlimited.

musicinmybrain commented 2 years ago

False positives are possible in this particular test. It requires 6 participating threads to be running simultenously uninterrupted.

I made this check a warning now. 7013a8b

This makes sense, especially as the available s390x hardware has only three physical cores.

Some of the other builders in the Fedora infrastructure have fewer than six cores available—five is typical for both 32-bit and 64-bit ARM—but that seems not to have been a problem so far.

musicinmybrain commented 2 years ago

I successfully built 7013a8b46c9a89ffeb0f9ebd20c9b4ff2c3bd47e in Fedora.

musicinmybrain commented 2 years ago

It falls back to default page size. However, it uses MAP_LOCKED and EAGAIN says that it cannot lock the pages. That's ulimit -l which may need to be set to a higher value. On my systems I have it set to unlimited.

That makes sense. I should have looked at it more closely.

The default is 64 on most systems, which at 4k pages should be 128kiB, and the benchmarks want 32MiB.

I’m able to run the benchmarks locally with the default resource limits by patching out the MAP_LOCKED flag, which could affect benchmark accuracy but at least shows that they can build and start to run. (I’m still waiting for them to complete.)

On the other hand, there are various problems that keep the benchmarks from building on anything but x86_64, most obviously:

../../xenium/xenium/utils.hpp:80:4: error: #error "Unsupported compiler"
   80 |   #error "Unsupported compiler"
      |    ^~~~~

and

../../xenium/xenium/utils.hpp: In function ‘uint64_t xenium::utils::random()’:
../../xenium/xenium/utils.hpp:84:12: error: ‘getticks’ was not declared in this scope
   84 |     return getticks() >> 4;
      |            ^~~~~~~~

and

../src/benchmarks.cc:49:27: error: ‘__builtin_ia32_rdtsc’ was not declared in this scope; did you mean ‘__builtin_va_list’?
   49 | using cycles_t = decltype(__builtin_ia32_rdtsc());
      |                           ^~~~~~~~~~~~~~~~~~~~
      |                           __builtin_va_list

I think that indicates I should be satisfied with the tests and example building and running successfully.


Thanks again for adding these implementations! I will go ahead and close the issue, but feel free to reopen it if there are any more details you want to discuss, or ideas you would like me to test.