How does this project work?

jcbhmr commented 1 year ago

Hello! 👋

First, some context: I want to have a C++ => WASM demo that runs in the browser. All the options (list below) seem to be in various states of decay.

My benchmark of "working" is that tool $X can compile this and maybe run it (depending on the tool).

#include <string>
#include <iostream>

int main() {
  std::string message = "Hello world!";
  std::cout << message << "\n";
}

Here's my non-exhaustive list of C++ => WASM things that I found and tried

https://wasdk.github.io/WasmFiddle/ -- can't do C++ STL things?
https://mbebenita.github.io/WasmExplorer/ -- seems stuck on Clang v5; produces WAT
https://wapm.io/taybenlor/runno-clang -- couldn't get working in wapm.io's shell
https://wapm.io/syrusakbary/clang -- I think this is the "original"; also couldn't get to work on wapm.io
https://tbfleming.github.io/cib/ -- works for builtin examples; fails with my std::cout + std::string test
http://kripken.github.io/llvm.js/demo.html -- doesn't do C++, only LLVM IR
https://binji.github.io/wasm-clang/ -- works! is quite old though (3 years last update)
https://github.com/soedirgo/llvm-wasm -- demo works, my #include <string> causes it to choke
https://github.com/jprendes/emception (this project) -- works!

I am interested in learning more about how this project works in order to (hopefully) get some kind of higher-level API working. #7

How does this project work?

This is the big question that I have. Here's what I've gathered so far. I'd love more input and commentary of how this all works, the pain points, all of it. Tell me everything you tell your rubber ducky. 🦆

Working backwards from the demo:

There's an index.html file that imports JS from main.bundle.js
That index.html was just autogenerated by Webpack, so we focus on the webpack config
The webpack config says that demo/index.js is the entry
demo/index.js imports emception.worker.js that uses Comlink to expose WebWorker functions
emception.worker.js starts some other processes namely llvm-box, binaryen-box, node, python, main-python
emception.worker.js exposes a .run() function that starts a Python em++ (its a variable, but em++ is the only thing that index.js requests)
emception.worker.js also has some _run_process_impl() magic?
emception.worker.js imports the xyzProcess classes from "emception", so we look in package.json
Turns out the demo/ folder is the package "emception", so I guess we look for things that copy files into demo/?
Actually it turns out when you search for "demo" in all files with VS Code, there's a webpack alias for "emception" to point to ../build/emception!
Now we go looking for build/emception references...
Start looking in build-emception.sh which does a lot of fs stuff with cp
build-emception.sh only calls one other interesting script: build-packs.sh
build-packs.sh looks like it just goes down each dir in packs/ and executes the package.sh script in each
The first packs/emscripten/package.sh runs its own make.sh script
The make.sh script downloads v3.1.24 of emscripten-core and cds into the dir
It patches emscripten with some custom magic
It pulls the .cache folder out of the emsdk docker image
It creates a lazy cache module of some kind?
Then we move back to emception/package.sh which runs a wasm-package binary to presumably pack emception
I don't know what the wasm-package.cpp file does?
Not one on the build-packs.sh is packs/usr_bin/package.sh which appears to just somehow group a bunch of llvm-related binaries into a single exe? Is that with the wasm-package thing does?
Next is packs/wasm/package.sh which just copies some stuff?
Next is packs/cpython/package.sh which should be interesting... Nope it just does some cp, then runs the same wasm-package thing.
Last packs/working/package.sh which I don't know what it does? Does it compile example C++ code?
Turns out build-emception.sh was the last in a line of other build-xyz scripts to get executed.
The most interesting one is build-quicknode.sh which appears to build a super hacky version of the Node.js runtime into a wasm binary.
That's as far as I can get...

I also did some digging and found this quote on reddit:

Emscripten runs (mainly) on Python and Node, and internally uses clang/llvm and binaryen.

For Python, Emception uses a patched version of Pyodide, which is based on Cpython compiled with Emscripten.

For Node, Emception uses very hacky JS code in the browser.

For clang/llvm and binaryen, they are compiled to WebAssembly with Emscripten, with some hacks to reduce binary size.

— u/jprendes from I made Emception

jprendes commented 1 year ago

Hi @jcbhmr ,

Thank you for the detail analisys!

I'll try to explain how emception works:

The reddit comment is correct. But since then a few things changed:

Emception now uses upstream CPython instead of Pyodide.
Emception is now using a build of QuickJS instead of (very) hacky JS code to emulate NodeJS. It implements the minimum required libraries to get Emscripten running.

Now, the high level overview.

Emception is basically Emscripten runningi n the browser. Emscripten entry point (at least for the demo) is em++. It's a python script. This script invokes other programs (subprocesses). It invokes the follwing programs:

other python scripts
clang, lld, and a few other programs from llvm
some NodeJS scripts
wasm-opt, wasm-metadce and a few other programs from binaryen

The interaction between all these processes is through their standard output, and files in the filesystem. The only process that spawn subprocesses (and captures their standard output) is python.

To be able to run Emscripten in the browser we need:

all the required llvm and binaryen programs.
a python interpreter.
NodeJS. Or a JS runtime with a subset of NodeJS's libraries.
a way to allow python to spawn subprocesses and capture their output.
a way to share a filesystem between all the processes.
a way to populate the filesystem with the required files (i.e., all the python scripts, NodeJS scripts, all C++ header libraries, etc.)

The solution to 1 is easy, "just" compile all the llvm and binaryen programs to WebAssembly using Emscripten. A detail is that there's a lot of shared code between all these programs. To reduce the binary size, it makes sense to compile all of them to a unique binary similar to what busybox does. That's what llvm-box and binaryen-box do. There are a few technical considerations to do that, but that's not relevant now.

The solution to 2 is easy as well given all the upstream effort of Pyodide before, and more recently of CPython to add WebAssembly as a compilation target using Emscripten.

The solution to 3 is wasy as well using QuickJS. The main challenge is to identify the minimum subset of NodeJS libraries required to run the the Emscripten JS scripts. That's what quicknode does.

Point 4 is a bit more challenging. Emscripten doesn't try to emulate a multiprocess environment. This means that the system calls to start a subprocess (popen) is not available. In turn, that means that the python interpreter (which is, like everything else, compiled using Emscripten), won't be able to start sucpeocesses. To workaround this, Emception adds a new native module to CPython to execute JavaScript code. Then a sitecustomize.py script patches python's Popen class to execute JavaSript code to start a subprocess instead of using the popen system call as it would normally do. To run a new process, the javascript code basically checks based on the command line what program needs to be run, and executes the corresponding WebAssembly module. It also does a bit of set up, like populating argc anrd argv, and populating the environment variables inherited from the parent process.

Point 5 is a problem because each Emscripten module (i.e., the python interpreter, quicknode, llvm-box, etc) will execute using it's own virtual file system. For all of this to work, they need to share the same file system. The solution is to run an initial module, to create a virtual file system. All other modules an Emscripten JavaScript library (emlib/fsroot.js) to mount the initial module's file system as their root file system.

Finally, point 6 is where all the packs come in. You can think of a pack as a homebrew zip file (more like a tar file). That's what wasm-package does. It "packs" the files in the host, and then it "unpacks" them in the browser. In the host it creates a package containing all the files and directories need to run Emception, mainly:

All of Emscripten's python and JS scripts
All of python's standard library files

To save time when compiling, Emception also shipt the precompiled standard libraries. That's the Emscripten cache you mentioned. But the cache takes a lot of space, and most likely you won't need every single cached library. To work around that problem Emception uses a lazy cache. The files are only downloaded when they are needed. The lazy cache code is based on Emscripten's own createLazyFile function.

I think the working and usr_bin packages could be removed. The usr_bin package just links paths with argumens for the "boxed" programs, but that can be easily embedded in the JavaScript glue code. The only remaining package is the wasm package, and the reason for that is compression.

There's a lot of cping around when creating the packages. That can certainly be improved. The currend design is that the make.sh scripts create the folder structure that should go in the package. The package.sh wraps make.sh, and also create the package using wasm-package.

Emception uses brotli for compression. Brotli is a compression algorithm (like zip), but brotli can achieve much higher compression ratio in this case. Emception is hosted in github pages. Unfortunately github pages doesn't support brotli precompresses assets. Because of that, Emception tries to ship as many assets as a brotli compressed package file. This inclues the WebAssembly files, and that's why the wasm package exists.

Finally, since the brotli comrpession doesn't come from the webserver, the native decompressor in your browser won't decomrpess the package. That's why Emception ships a brotli decompressor.

All of this is brought together in the demo project.

Another point is that Emception executes in a blocking manner, and the execution can take a little while. To avoid blocking the main browser thread, it runs the code in a WebWorker, and uses comlink to simplify the interaction with it.

I hope that was helpful and answered your questions! I'll keep the issue open in case you have further questions.

jcbhmr commented 1 year ago

Thanks for your detailed response!

The solution to 1 is easy, "just" compile all the llvm and binaryen programs to WebAssembly using Emscripten. A detail is that there's a lot of shared code between all these programs. To reduce the binary size, it makes sense to compile all of them to a unique binary similar to what busybox does. That's what llvm-box and binaryen-box do. There are a few technical considerations to do that, but that's not relevant now.

If I were to, say, fork the llvm-project repo and add in the llvm-project.patch changes, what else would I need to do in the pipeline to get the a llvm-box.wasm file? Is there even an llvm-box.wasm file that gets generated somehow? That seems like a good place to start breaking things apart into constituent projects. https://github.com/jcbhmr/llvm-box#readme

jprendes commented 1 year ago

First, the patch.

Under some circumstances clang will create a subprocess. This is not great for WebAssembly. The patch removing the if (!InProcess) check will prevent that behaviour.

I don't know if there are any negative side effects to doing this. It seems to work fine on the way emscripten uses clang. See the comments around that code, maybe it will shed some light into it.

I also don't know if there are other invocations of clang that could create a subprocess. It seems there are none in the way emscripten uses clang. At least I haven't found any so far.

The second part of the patch is less important. Clang likes to append its major version to the binary name. When running with emcmake that meant it would generate files named clang.js-14 or clang.js-15. The patch simply removes the version number from there so that it becomes clang.js

Then, to building llvm. Some parts of llvm's code are autogenerated during the build process. To be able to do that llvm uses llvm-tblgen and clang-tblgen. You first need to compile these two tools in the host before compiling for WebAssembly. You can tell llvm where these tools are using the corresponding CMake variables.

If you try to compile now, after que a while it will fail. That's because llvm uses a system call called wait4. But for some reason, instead of including the header that declares the function they decided to predeclare the function themselves. This is a problem because on Emscripten this function is a define to __syscall_wait4. You can patch the source to fix that. Enception takes the hacky approach of defining wait4=__syscall_wait4 in the compiler flags.

If you try to build now, it should work. You should get one .wasm and one .js file per executable. Yay!

This is where the part of bundling it as one binary starts. After configuring llvm with cmake using Ninja as the build system, Emception runs the script called patch-ninja.sh. That script creates the rules to build llvm-box.

Each llvm executable has its own main function. If we try to compile them all together, there would have multiple conflicting definitions of main. After compiling the .cpp source to WebAssembly object files and before linking everything together, Emception runs the wasm-transform tool. That tool analyses the object file and renames main to add a hash so that each executable will have a uniquely named main function.

Moreover, each executable uses some global state, which is initializes even before main is called. This is done through functions specially tagged to run before main. wasm-transform removes that tag, turning them into standard functions.

Finally, a new main function is created for llvm-box. This function uses argv[0] to identify which executable you actually wanted to call. It then calls (in order) the functions (no longer) specially tagged functions that initialize the global state for that executable, and finally it calls the renamed main for the target executable.

All of this should work as long as the object files are WebAssembly files. This is not true of you build with lto enabled. That's not a problem with llvm as it's not enabled by default, but it is a problem with binaryen, which enables lto by default producing llvm-ir object files instead of WebAssembly. For that case, we just simple path the ninja fine to disable lto.

If you want an independent llvm-box, I think the most sensible thing to do would be also splitting apart wasm-transform, and you will require that tool.

The patching of the ninja files could be avoided by adding some custom cmake build steps that generate equivalent rules. That would be much cleaner.

Again, hope that helps! You llvm fork is looking neat!

jcbhmr commented 1 year ago

In the build-llvm.sh script, there's this proxyfs.js thing; what is that?

https://github.com/jprendes/emception/blob/366065547b1a59cb58011ed19aedce70c3bcbd2b/build-llvm.sh#L56

pmp-p commented 1 year ago

@jcbhmr it's coming from this one https://emscripten.org/docs/api_reference/Filesystem-API.html#proxyfs

jcbhmr commented 1 year ago

@jprendes What does the patch-ninja.sh script do? https://github.com/jprendes/emception/blob/master/patch-ninja.sh The gist that I was able to read from it is that it dynamically somehow creates a new llvm-box target in the generated Ninja stuff that comes from CMake.

If I wanted to add a target to CMakeLists.txt, what would I need to do to replicate the llvm-box target from Ninja in CMake? How would I go about doing that? I assume I'd add something to the bottom of this file https://github.com/jcbhmr/llvm-box/blob/get-it-working/llvm/CMakeLists.txt#L1301 like

add_custom_target(my_custom_target
    DEPENDS
        "${CMAKE_CURRENT_BINARY_DIR}/generated_file"
)

add_custom_command(
    OUTPUT
        "${CMAKE_CURRENT_BINARY_DIR}/generated_file"
    COMMENT
        "This is generating my custom command"
    COMMAND
        ${CMAKE_COMMAND} -E touch ${CMAKE_CURRENT_BINARY_DIR}/generated_file
    DEPENDS
        ${CMAKE_CURRENT_SOURCE_DIR}/source_file
)

I got the gist of how to make a cmake custom target thing from this https://dev.to/iblancasa/learning-cmake-3-understanding-addcustomcommand-and-addcustomtarget-43gp

jcbhmr commented 1 year ago

@jcbhmr it's coming from this one https://emscripten.org/docs/api_reference/Filesystem-API.html#proxyfs

@pmp-p btw I never thanked you; thank you this answered that question!

jprendes commented 1 year ago

@jcbhmr , I've created the new-build-system branch. The branch shows you how to get rid of patch-ninja.sh and do everything from CMake. The changes there are still missing some compiler and linker flags, but should give you a good starting point.

jprendes / emception

How does this project work? #15

How does this project work?