jbaldwin / libcoro

C++20 coroutine library
Apache License 2.0
599 stars 62 forks source link

Changes to be able to build to webassembly with emscripten. #201

Closed bruno-j-nicoletti closed 11 months ago

bruno-j-nicoletti commented 1 year ago

The main changes to get it to build with emscripten were...

To build it you need to download the emsdk and set it up you shell's paths appropriately. Then go...

mkdir build-emscripten-release
cd build-emscripten-release
emcmake cmake -GNinja -DCMAKE_BUILD_TYPE=Release ..
ninja
jbaldwin commented 11 months ago

I've ignored the file in codacy: include/coro/detail/tl_expected.hpp

jbaldwin commented 11 months ago
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh

build per bruno's steps and then run tests via: node ./test/libcoro_test.js

jbaldwin commented 11 months ago

@bruno-j-nicoletti can you add one more item to this PR which is adding a CI run to build and compile with emscripten and run the tests please? I was able to get it to work locally so it shouldn't be too difficult to automate it.

bruno-j-nicoletti commented 11 months ago

I've added a CI pass for webassmbly and I've had to tweak test_task.cpp as em++ seems to be adding padding to std::detail::promise

jbaldwin commented 11 months ago

I'll rebase once it completes and run the test suite one more time. This is an awesome change :)

jbaldwin commented 11 months ago

There seems to be a new flaky test thats been introduced, doesn't really seem to be anything related to your change osit... but clang-16 just doesn't want to finish it seems.

bruno-j-nicoletti commented 11 months ago

The only change I could have made that might have done this is the flip to use std::thread over std::jthread. I could put the compile time switch back in to see if that works.

jbaldwin commented 11 months ago

Yeah, I really don't see why your change is causing this. I'm going to try and replicate locally...

jbaldwin commented 11 months ago

running ubuntu 22.04 with g++-12 dosen't hang like it is here, I might try in a docker container with more constrained resourcecs... but its weird too because other PRs based off main are passing just fine

jbaldwin commented 11 months ago

TEST_CASE("ring_buffer many elements many producers many consumers", "[ring_buffer]") This is most likely the test case failing, its using a ring_buffer with a thread_pool and has a high level of concurrency.

jbaldwin commented 11 months ago

I'll try running CI on main and see if it replicates but so far I haven't seen this on anything merged recently which makes me think it's something with the std::thread change.

jbaldwin commented 11 months ago

main ci has passed twice no problems, going to run it a 3rd time.

fedora 36 has a dnf package manager problem atm, so until they fix that it will fail everytime.

I've looked over the new failing ones and they've failed on a new spot... I'll try and take a deeper look soon.

jbaldwin commented 11 months ago

3rd time also passed, just weird package manager problems :roll_eyes:

jbaldwin commented 11 months ago

So I had a bunch of fun with this today and I think I found the bug. It was what I originally thought but acquire and release don't quite work how I thought they did :frowning:. If you switch to seq_cst I believe your PR should start passing, but there are two other places that need it as well, in the io_scheduler.shutdown() when it exchanges the m_shutdown_requested and the same place in the thread_pool::shutdown() with the same variable name. So those two and the the two I already had you change in the executor function. I can push a mirrored PR of your branch with the changes if that is easier.

I believe the jthread stop_token was issuing a memory barrier and updating all the local variables across threads that we just couldn't see until now.. when it wasn't happening.

bruno-j-nicoletti commented 11 months ago

I've just made the memory order changes you suggested and pushed it. Getting relaxed memory ordering right is hard, which is why most things I've read say stick to the default of seq_cst.

bruno-j-nicoletti commented 11 months ago

I've made the changes you suggested (could you double check please) as well as rebasing. I made it spit out XML so we can see what tests are hanging. This caused a SEGV on top of hanging. It still hangs in the following tests...

On Fedora there is a SEGV during coverage tests in '"benchmark tcp_server echo server thread pool".

I cant see a pattern in any of that.

jbaldwin commented 11 months ago

I stepped through it in gdb and the m_shutdown_requested field was getting set from the test, but wasn't propagating to the other thread_pool threads for the tests that it is hanging on, so the thread_pools never shutdown. Thats why using seq_cst should in theory fix it.. but clearly theres something deeper wrong in the solution that was being hidden.

The changes you made look right for what I tested locally. I was able to sometimes reproduce if I ran it enough times and seq_cst seemed to fix it, but clearly the github runners seem to disagree.

I'm going to open another PR that only does the seq_cst and std::thread change so we can maybe isolate it a bit more.

jbaldwin commented 11 months ago

I've got a working version of coro::thread_pool using std::thread here. We can merge that in and then you can rebase, I'm pretty confident this is the issue and should resolve the problems on your PR.

bruno-j-nicoletti commented 11 months ago

Do you want me to formally review your code? It looks sane to me (and even better, looks sane to the tests.)

jbaldwin commented 11 months ago

Great, thanks for looking. I've merged it if you want to rebase and :crossed_fingers: this PR starts working!

bruno-j-nicoletti commented 11 months ago

Almost there! There is still a SEGV failure in the coverage test on Fedora. This only appeared when I tweaked the unit test to output via an XML reporter. I could change the reporter back to 'console', but it really shouldn't be crashing if you change the reporter.

jbaldwin commented 11 months ago

The failed test on fedora 32 is the ssl/tls bug I'm working through, so I retriggered that run.

*edit: I just saw your message, we can remove the xml reporting, or make it an option.

jbaldwin commented 11 months ago

I also left just a few comments to really clean it up if you don't mind taking a look at them.

bruno-j-nicoletti commented 11 months ago

I've removed the XML reporter and fixed the bits you requested. I can't see for the life of me why the precommit keeps doing weird thing the to readme.

jbaldwin commented 11 months ago

Heck yeah its passing :D :D

jbaldwin commented 11 months ago

There are a couple more seq_cst changes still, can you undo those across all the files please?

bruno-j-nicoletti commented 11 months ago

Removed all the extraneous occurrences of seq_cst as requested.

Parallel programming is hard.

jbaldwin commented 11 months ago

apt failed on ubuntu 22.04, i'll re-run it

bruno-j-nicoletti commented 11 months ago

22.04 dies on installing requirements via apt.

jbaldwin commented 11 months ago

yeah, its a weird normal failure, I don't know why the package managers are so flaky but I see this all the time on lots of projects. Only way to really get around it is either self hosting or baking the installs into a docker file (which I don't really want to maintain)

bruno-j-nicoletti commented 11 months ago

Looks likes its done then. Thanks for all the effort you put into figuring it out, and for the library in the first place.

jbaldwin commented 11 months ago

And we're merged! Nice job, its a great addition to support this.

bruno-j-nicoletti commented 11 months ago

Hurrah!