Open jcbhmr opened 1 year ago
Hi @jcbhmr ,
Thank you for the detail analisys!
I'll try to explain how emception works:
The reddit comment is correct. But since then a few things changed:
Now, the high level overview.
Emception is basically Emscripten runningi n the browser.
Emscripten entry point (at least for the demo) is em++
. It's a python script.
This script invokes other programs (subprocesses). It invokes the follwing programs:
clang
, lld
, and a few other programs from llvm
wasm-opt
, wasm-metadce
and a few other programs from binaryen
The interaction between all these processes is through their standard output, and files in the filesystem. The only process that spawn subprocesses (and captures their standard output) is python.
To be able to run Emscripten in the browser we need:
llvm
and binaryen
programs.The solution to 1 is easy, "just" compile all the llvm
and binaryen
programs to WebAssembly using Emscripten. A detail is that there's a lot of shared code between all these programs. To reduce the binary size, it makes sense to compile all of them to a unique binary similar to what busybox does. That's what llvm-box
and binaryen-box
do. There are a few technical considerations to do that, but that's not relevant now.
The solution to 2 is easy as well given all the upstream effort of Pyodide before, and more recently of CPython to add WebAssembly as a compilation target using Emscripten.
The solution to 3 is wasy as well using QuickJS. The main challenge is to identify the minimum subset of NodeJS libraries required to run the the Emscripten JS scripts. That's what quicknode
does.
Point 4 is a bit more challenging. Emscripten doesn't try to emulate a multiprocess environment. This means that the system calls to start a subprocess (popen
) is not available. In turn, that means that the python interpreter (which is, like everything else, compiled using Emscripten), won't be able to start sucpeocesses. To workaround this, Emception adds a new native module to CPython to execute JavaScript code. Then a sitecustomize.py
script patches python's Popen
class to execute JavaSript code to start a subprocess instead of using the popen
system call as it would normally do. To run a new process, the javascript code basically checks based on the command line what program needs to be run, and executes the corresponding WebAssembly module. It also does a bit of set up, like populating argc
anrd argv
, and populating the environment variables inherited from the parent process.
Point 5 is a problem because each Emscripten module (i.e., the python interpreter, quicknode
, llvm-box
, etc) will execute using it's own virtual file system. For all of this to work, they need to share the same file system. The solution is to run an initial module, to create a virtual file system. All other modules an Emscripten JavaScript library (emlib/fsroot.js
) to mount the initial module's file system as their root file system.
Finally, point 6 is where all the packs come in. You can think of a pack as a homebrew zip file (more like a tar file). That's what wasm-package
does. It "packs" the files in the host, and then it "unpacks" them in the browser. In the host it creates a package containing all the files and directories need to run Emception, mainly:
To save time when compiling, Emception also shipt the precompiled standard libraries. That's the Emscripten cache you mentioned. But the cache takes a lot of space, and most likely you won't need every single cached library. To work around that problem Emception uses a lazy cache. The files are only downloaded when they are needed. The lazy cache code is based on Emscripten's own createLazyFile
function.
I think the working
and usr_bin
packages could be removed. The usr_bin
package just links paths with argumens for the "boxed" programs, but that can be easily embedded in the JavaScript glue code. The only remaining package is the wasm
package, and the reason for that is compression.
There's a lot of cp
ing around when creating the packages. That can certainly be improved. The currend design is that the make.sh
scripts create the folder structure that should go in the package. The package.sh
wraps make.sh
, and also create the package using wasm-package
.
Emception uses brotli for compression. Brotli is a compression algorithm (like zip), but brotli can achieve much higher compression ratio in this case. Emception is hosted in github pages. Unfortunately github pages doesn't support brotli precompresses assets. Because of that, Emception tries to ship as many assets as a brotli compressed package file. This inclues the WebAssembly files, and that's why the wasm
package exists.
Finally, since the brotli comrpession doesn't come from the webserver, the native decompressor in your browser won't decomrpess the package. That's why Emception ships a brotli decompressor.
All of this is brought together in the demo
project.
Another point is that Emception executes in a blocking manner, and the execution can take a little while. To avoid blocking the main browser thread, it runs the code in a WebWorker, and uses comlink to simplify the interaction with it.
I hope that was helpful and answered your questions! I'll keep the issue open in case you have further questions.
Thanks for your detailed response!
The solution to 1 is easy, "just" compile all the llvm and binaryen programs to WebAssembly using Emscripten. A detail is that there's a lot of shared code between all these programs. To reduce the binary size, it makes sense to compile all of them to a unique binary similar to what busybox does. That's what llvm-box and binaryen-box do. There are a few technical considerations to do that, but that's not relevant now.
If I were to, say, fork the llvm-project repo and add in the llvm-project.patch changes, what else would I need to do in the pipeline to get the a llvm-box.wasm file? Is there even an llvm-box.wasm file that gets generated somehow? That seems like a good place to start breaking things apart into constituent projects. https://github.com/jcbhmr/llvm-box#readme
First, the patch.
Under some circumstances clang
will create a subprocess. This is not great for WebAssembly. The patch removing the if (!InProcess)
check will prevent that behaviour.
I don't know if there are any negative side effects to doing this. It seems to work fine on the way emscripten uses clang. See the comments around that code, maybe it will shed some light into it.
I also don't know if there are other invocations of clang that could create a subprocess. It seems there are none in the way emscripten uses clang. At least I haven't found any so far.
The second part of the patch is less important. Clang likes to append its major version to the binary name. When running with emcmake that meant it would generate files named clang.js-14
or clang.js-15
. The patch simply removes the version number from there so that it becomes clang.js
Then, to building llvm
. Some parts of llvm
's code are autogenerated during the build process. To be able to do that llvm
uses llvm-tblgen
and clang-tblgen
. You first need to compile these two tools in the host before compiling for WebAssembly. You can tell llvm
where these tools are using the corresponding CMake
variables.
If you try to compile now, after que a while it will fail. That's because llvm
uses a system call called wait4
. But for some reason, instead of including the header that declares the function they decided to predeclare the function themselves. This is a problem because on Emscripten this function is a define to __syscall_wait4
. You can patch the source to fix that. Enception takes the hacky approach of defining wait4=__syscall_wait4
in the compiler flags.
If you try to build now, it should work. You should get one .wasm
and one .js
file per executable. Yay!
This is where the part of bundling it as one binary starts. After configuring llvm
with cmake
using Ninja
as the build system, Emception runs the script called patch-ninja.sh
. That script creates the rules to build llvm-box
.
Each llvm
executable has its own main
function. If we try to compile them all together, there would have multiple conflicting definitions of main
. After compiling the .cpp
source to WebAssembly object files and before linking everything together, Emception runs the wasm-transform
tool. That tool analyses the object file and renames main
to add a hash so that each executable will have a uniquely named main
function.
Moreover, each executable uses some global state, which is initializes even before main is called. This is done through functions specially tagged to run before main. wasm-transform
removes that tag, turning them into standard functions.
Finally, a new main
function is created for llvm-box
. This function uses argv[0]
to identify which executable you actually wanted to call. It then calls (in order) the functions (no longer) specially tagged functions that initialize the global state for that executable, and finally it calls the renamed main
for the target executable.
All of this should work as long as the object files are WebAssembly files. This is not true of you build with lto
enabled. That's not a problem with llvm
as it's not enabled by default, but it is a problem with binaryen
, which enables lto
by default producing llvm-ir
object files instead of WebAssembly. For that case, we just simple path the ninja fine to disable lto
.
If you want an independent llvm-box
, I think the most sensible thing to do would be also splitting apart wasm-transform
, and you will require that tool.
The patching of the ninja
files could be avoided by adding some custom cmake
build steps that generate equivalent rules. That would be much cleaner.
Again, hope that helps!
You llvm
fork is looking neat!
In the build-llvm.sh script, there's this proxyfs.js thing; what is that?
@jcbhmr it's coming from this one https://emscripten.org/docs/api_reference/Filesystem-API.html#proxyfs
@jprendes What does the patch-ninja.sh script do? https://github.com/jprendes/emception/blob/master/patch-ninja.sh The gist that I was able to read from it is that it dynamically somehow creates a new llvm-box target in the generated Ninja stuff that comes from CMake.
If I wanted to add a target to CMakeLists.txt, what would I need to do to replicate the llvm-box target from Ninja in CMake? How would I go about doing that? I assume I'd add something to the bottom of this file https://github.com/jcbhmr/llvm-box/blob/get-it-working/llvm/CMakeLists.txt#L1301 like
add_custom_target(my_custom_target
DEPENDS
"${CMAKE_CURRENT_BINARY_DIR}/generated_file"
)
add_custom_command(
OUTPUT
"${CMAKE_CURRENT_BINARY_DIR}/generated_file"
COMMENT
"This is generating my custom command"
COMMAND
${CMAKE_COMMAND} -E touch ${CMAKE_CURRENT_BINARY_DIR}/generated_file
DEPENDS
${CMAKE_CURRENT_SOURCE_DIR}/source_file
)
I got the gist of how to make a cmake custom target thing from this https://dev.to/iblancasa/learning-cmake-3-understanding-addcustomcommand-and-addcustomtarget-43gp
@jcbhmr it's coming from this one https://emscripten.org/docs/api_reference/Filesystem-API.html#proxyfs
@pmp-p btw I never thanked you; thank you this answered that question!
@jcbhmr , I've created the new-build-system
branch.
The branch shows you how to get rid of patch-ninja.sh
and do everything from CMake.
The changes there are still missing some compiler and linker flags, but should give you a good starting point.
Hello! 👋
First, some context: I want to have a C++ => WASM demo that runs in the browser. All the options (list below) seem to be in various states of decay.
My benchmark of "working" is that tool $X can compile this and maybe run it (depending on the tool).
Here's my non-exhaustive list of C++ => WASM things that I found and tried
std::cout
+std::string
test#include <string>
causes it to chokeI am interested in learning more about how this project works in order to (hopefully) get some kind of higher-level API working. #7
How does this project work?
This is the big question that I have. Here's what I've gathered so far. I'd love more input and commentary of how this all works, the pain points, all of it. Tell me everything you tell your rubber ducky. 🦆
Working backwards from the demo:
I also did some digging and found this quote on reddit:
— u/jprendes from I made Emception