Cheerp / Emscripten size comparison

leaningtech / cheerp-meta

Cheerp - a C/C++ compiler for Web applications - compiles to WebAssembly and JavaScript

https://labs.leaningtech.com/cheerp

Other

1.03k stars 51 forks source link

Cheerp / Emscripten size comparison #76

Closed kripken closed 5 years ago

kripken commented 6 years ago

Hi Cheerp devs :)

I see your website says "30% smaller than Emscripten" so I was curious to measure that. Running emscripten's tests/test_benchmark.py script, I see the following results which are very different:

Emscripten output size is smaller than Cheerp's on all the testcases. (The script in that link measures size by adding the size of the .wasm and .js files together, but it's also true when just looking at the wasm.)
For example, emscripten is 43% smaller on test_primes and 27% smaller on test_memops.
Looking at the largest comparable test, test_box2d, emscripten is 23% smaller there.

That's for size. Speed-wise, the results are a mix (some a little faster, some a little slower, some about the same).

Several of the tests hit problems when running the Cheerp output:

test_base64 hits Error: this should be unreachable.
test_fasta_float hits RuntimeError: index out of bounds.
test_havlak hangs.
test_box2d hits RuntimeError: indirect call signature mismatch.

I also couldn't get some tests to build in Cheerp: test_linpack, test_bullet, test_lua_binarytrees, test_lua_scimark, test_zlib. For example test_zlib says

Intrinsic name not mangled correctly for type arguments! Should be: llvm.cheerp.cast.user.p0struct._Z14internal_state.1.p0i8
%struct._Z14internal_state.1* (i8*)* @llvm.cheerp.cast.user.p0struct._Z14internal_state.p0i8

Maybe the script doesn't build them right? It calls Cheerp's llvm-ar etc., and the commands work with emscripten and native builds, but maybe something more needs to be done for Cheerp? Or am I hitting Cheerp limitations - would zlib, Lua, etc. need to be ported to Cheerp first?

This is on Cheerp 1516806243-1~xenial (which is the latest nightly build I see for Xenial, from Jan 24) and emscripten 1.37.36 (last tagged release, from Mar 13).

These results are very different from the ones you reported, perhaps we are not measuring the same thing somehow? (All the details of how I got the measurements mentioned above are in the linked script that runs those benchmarks, I basically just ran that script as-is except for uncommenting the line to enable running Cheerp.) Or maybe our results are about different versions?

alexp-sssup commented 6 years ago

Hello Alon. Sorry for the delay, we needed some time to reproduce results and gather all required data.

We have been pleasantly surprised that Cheerp is now integrated in emscripten's benchmarks. We have been maintaining our own branch of emscripten to run tests, which we have now rebased on 1.37.36 and released. You can find it here.

About size results, the differences arise from the test cases being slightly different. I will focus on test_primes and test_memops on this discussion, other benchmarks are available in the branch linked above.

To reproduce our results you need to simply remove printf statements from the test source. Both in test_primes and test_memops printf is used to report an invalid test size and to print the test results. In our modified version we remove the printf calls and directly return the numerical result of the test. As an example here is the source of memops with our changes:

      #include <stdio.h>
      #include <string.h>
      #include <stdlib.h>
      int main(int argc, char **argv) {
        int N, M;
        int arg = argc > 1 ? argv[1][0] - '0' : 3;
        switch(arg) {
          case 0: return 0; break;
          case 1: N = 1024*1024; M = 55; break;
          case 2: N = 1024*1024; M = 400; break;
          case 3: N = 1024*1024; M = 800; break;
          case 4: N = 1024*1024; M = 4000; break;
          case 5: N = 1024*1024; M = 8000; break;
          default: /*printf("error: %d\\n", arg);*/ return -1;
        }

        int final = 0;
        char *buf = (char*)malloc(N);
        for (int t = 0; t < M; t++) {
          for (int i = 0; i < N; i++)
            buf[i] = (i + final)%256;
          for (int i = 0; i < N; i++)
            final += buf[i] & 1;
          final = final % 1000;
        }
        //printf("final: %d.\\n", final);
        return final;
      }

Of course, we use the same modified sources when building with Emscripten and with Cheerp.

About larger scale tests, llvm-ar is not supported, we link libraries using llvm-link as documented here.

Various test cases need some patching to run with Cheerp. The branch we published contains all required patches. It should be noted that the tests were originally ported to be compiled to plain JS (Cheerp genericjs mode), so the patches could be heavily reduced if only the wasm target is of interest.

alexp-sssup commented 6 years ago

As a side note, the webMain code currently generated is invalid. We recommend changing it to something like this: https://bitbucket.org/apignotti/emscripten/commits/60620e0860099f20b8fe5855d7e6272ef4b14b6a?at=cheerp-fixes-2018mar-rebased

kripken commented 6 years ago

Thanks @alexp-sssup !

To reproduce our results you need to simply remove printf statements from the test source

I see, so we are indeed measuring something different.

Yes, good point, when using printf in a benchmark that is just a few lines of code like primes, probably most of the output code size is due to printf itself. printf is still interesting in a way (it is real-world C code), but maybe not that interesting in general.

However, without printf, output from tiny benchmarks like primes end up being dominated by the runtime overhead, which is also not that interesting in general (since people compiling just a few lines of code is pretty rare - still, I added a primes_nocheck benchmark to our suite to measure that).

Overall, I'm more interested in moderate or large code size projects (the common case that I see among users), like say Box2D. Do you see the same as what I reported on that one (emscripten being 23% smaller than cheerp)?

About larger scale tests, llvm-ar is not supported, we link libraries using llvm-link

I see. So configure/make like say zlib, lua, etc. benchmarks require won't work on cheerp, and I'd need to write a makefile manually if I want those tests to run?

As a side note, the webMain code currently generated is invalid. We recommend changing it to something like this: https://bitbucket.org/apignotti/emscripten/commits/60620e0860099f20b8fe5855d7e6272ef4b14b6a?at=cheerp-fixes-2018mar-rebased

That links requires me to log in, so I can't view it.

alexp-sssup commented 6 years ago

Like primes and memops our version of Box2D is patched. Many of the patches are there for genericjs type safety, but printf is also disabled. From my tests it seems that most of the size difference you measure comes from printf indeed. https://github.com/alexp-sssup/emscripten/commit/8b686069720902aa50ea5b14b69d3af7b4fd393e#diff-d6aa119c750aae03c61eea396a82d07b

With this patch, in my tests, the size between emscripten and cheerp becomes roughly the same. There is a ~2% difference either up or down depending if you choose the compressed or uncompressed version. As usual the patched version is used when compiling both the emscripten and cheerp builds.

configure/make should actually work, by using a wrapper script. This is documented here.

About the link, I pasted the one from our private repos instead of the public one. I apologize. Here is the correct one: https://github.com/alexp-sssup/emscripten/commit/60620e0860099f20b8fe5855d7e6272ef4b14b6a

kripken commented 6 years ago

Like primes and memops our version of Box2D is patched. Many of the patches are there for genericjs type safety, but printf is also disabled. From my tests it seems that most of the size difference you measure comes from printf indeed. alexp-sssup/emscripten@8b68606#diff-d6aa119c750aae03c61eea396a82d07b

Interesting, yes, I see that when I just remove the printf from box2d (using your patch

// Disable printing to stdout for Cheerp and Emscripten.
#define printf(fmt, ...) (0)

) then the cheerp and emscripten sizes become close.

But this seems odd. Why does cheerp go from 150K to 122K just by removing printf - is that expected?

Also, I'm not sure the benchmark is valid without the printing. Without printf, the LLVM optimizer may be able to remove code that we want to execute, but now has no side effects.

If printf is a problem for cheerp, is there some other way to print stuff, that is efficient for you?

configure/make should actually work, by using a wrapper script. This is documented here.

Thanks, but I still can't get it to work, though. First, --host=cheerp-unknown-none is a specific flag that I guess some projects support? But e.g. zlib (first I tried) does not. Second, even removing that flag, cheerpwrap doesn't help with the problem of the configure script emitting stuff that uses llvm-ar and other things that don't work in cheerp. (Does cheerpwrap do anything more than the emscripten benchmark runner already does, which is point CC, CXX to the various cheerp binaries?)

Anyhow, maybe I missed or misunderstood something there. In general, it would be great to have a shared script for these comparisons so we know and agree they are fair - perhaps you want to upstream some of the changes in your fork?

kripken commented 6 years ago

I found some time this weekend to dive into the box2d differences here in more detail.

A large source of differences is in system library code:

The biggest factor is actually c++ rtti. Is that off in Cheerp by default? I don't see a difference when disabling it there. In emscripten, rtti is on by default (same as clang/gcc/etc.), and disabling it shrinks the output from 115k to 96k, at which point emscripten is 21% smaller than Cheerp.
A smaller but still significant source of difference is in malloc. Emscripten by default uses dlmalloc, which is larger than what Cheerp has. Emscripten also has an option to use a smaller custom allocator, which shrinks the output from 115k to 108k. Using the smaller allocator does impact performance on malloc-heavy benchmarks, though. Given Cheerp's malloc is smaller (either because it's compiled differently, or it's something simpler than dlmalloc) I'd guess it would do less well on such benchmarks, like say Havlak. I wasn't able to test that theory though because as mentioned above, Cheerp's output in Havlak hits an infinite loop. Another thing I tried was to make the two compilers use the same dlmalloc for a more direct comparison, but Cheerp fails to compile dlmalloc (with Intrinsic name not mangled correctly for type arguments! Should be: llvm.memcpy.p0union._ZN10_mbstate_tUt_E.p0union._ZN10_mbstate_tUt_E.i64 void (%union._ZN10_mbstate_tUt_E*, %union._ZN10_mbstate_tUt_E*, i64, i32, i1)* @llvm.memcpy.p0union._ZN10_mbstate_tUt74_3_E.p0union._ZN10_mbstate_tUt74_3_E.i64).
As mentioned about malloc, how system libs are compiled also matters. Emscripten's are mostly -O2 and -Os. Looking at Cheerp output, I would guess it is optimized differently, but I'm not sure how.

Overall, system lib differences account for a lot of the size differences between the compilers, but even though that's interesting to know, it's always going to be a tradeoff between compiling for size or speed - if one compiler started to build system libs with -Oz it might emit smaller code but eventually users would notice it isn't as fast, etc. So maybe this isn't that important.

Because of that I also did a dive into the wasm binaries themselves, looking function by function. I focused on the largest functions in box2d, which are

ZN12b2EPCollider7CollideEP10b2ManifoldPK11b2EdgeShapeRK11b2TransformPK14b2PolygonShapeS7
_ZN7b2World8SolveTOIERK10b2TimeStep
_ZN8b2Island5SolveEP9b2ProfileRK10b2TimeStepRK6b2Vec2b

Looking at their binary sizes, Emscripten is smaller on all of them, by 18%, 10%, and 23% respectively.

Another way to look at that is to run the Binaryen optimizer on Cheerp output, and it shrinks it by 15%. That's pretty close to the per-function results, which makes sense if the two compiler's output is mostly similar, except that emscripten also runs the Binaryen optimizer.

To summarize,

System lib differences (which ones are used, how they are optimized, etc.) lead to different binary sizes, but that mostly reflects the tradeoffs chosen between size and speed, there is no "better" decision there.
Looking at the exact same code compiled by the two compilers, emscripten's output is significantly smaller (around 15%), largely due to emscripten running the Binaryen optimizer and Cheerp not.
A more complete comparison was limited by some programs not compiling or not running properly on Cheerp, like dlmalloc and Havlak as mentioned above - is there documentation for what the current limitations are of Cheerp's wasm (not genericjs) output?

alexp-sssup commented 6 years ago

Hello Alon, keep in mind that there has been significant changes in our Wasm backend since my last comment, so you will need to use updated packages to reproduce our exact results.

I will try to answer all the issue you raised.

Influence of printf on size: The impact that printf has on the compiled size is indeed a bit surprising. I suspect that the issue is not caused by printf itself, but rather by the additional C library infrastructure that it brings in. Is the implementation of printf shipped with Emscripten complete or is it simplified?
Influence of printf on benchmark validity: This is a good point. Still, execution time barely changes when disabling printf which suggests that the code is being executed nevertherless. To verify this I have modified the benchmark to print out (using a low overhead method cheerp::console_log) the 4 values which are usually included in the printf-ed string. There is a <1% change in size and no measurable change in speed, which confirms the validity of the printf-less test.
Updated box2d WASM size with and without printf: All sizes are after compression. Note, the script has been changed to enable -frtti as per your later comment. More details are below.

Compiler	With printf	Without printf	With console_log
Emscripten	48328	48076	N/A
Cheerp	57851	46270	46380

Low overhead output: As mentioned above Cheerp defines in its headers a convenience method to output to console from all execution modes: cheerp::console_log. This is implemented using the normal JavaScript interoperability support of Cheerp and it is not special cased.
Zlib/Configure support: The issue here is that zlib does not provide an actual autoconf configure, but a custom build script which happens to be named configure. --host=cheerp-unknown-none is not a special flag which needs to be manually supported, but a way to tell configure that we are cross compiling. Configure will then use cheerp-unknown-none- prefixed tools that we provide in CHEERP_PREFIX/libexec. These tools provide an (incomplete) emulation layer for ar and ranlib which are used by configure. The good news is that you should be able to point the appropriate env variables to these tools to get zlib to build. In our branch we have fixed zlib building with this patch https://github.com/alexp-sssup/emscripten/commit/f1f9c3adcbd600113e8b3a472e1a88762591e69a#diff-431a828f43294d297117e82dbaf768ba
Shared build script: I agree that it would make a lot of sense to have a shared/unified benchmark script, and I think that the work you have done so far to integrate Cheerp already helps a lot. We will try to contribute back the minimal patches that are required for wasm/asmjs support. About the type safety patches that are needed for plain JS generation, are you open to dicuss integrating them as well?
RTTI support: Cheerp disables RTTI support by default, that is a deliberate choice. Of the large scale codebases we have seen so far there are many not using RTTI features so disabling the flag by default seems like a good idea, especially on the Web where size is an important metric. RTTI support can be enabled with the standard -frtti command line flag. This said, enabling RTTI barely changes the compiled size.

Box2D without -frtti	Box2D with -frtti
45226	46380

Malloc implementation: Cheerp uses newlib for its C library. The implementation of malloc is also dlmalloc. https://github.com/leaningtech/cheerp-newlib/blob/master/newlib/libc/stdlib/mallocr.c
Additional test cases: We will investigate why Havlak is not working.
Size/speed tradeoff: In general we do not favor code size over other metrics. All our libs are built with O2/O3. Os is only used during the LTO phase to avoid excessive inlining. All the improvements we have done to reduce code size have been on the backend side (i.e. JS and WASM codegen)

kripken commented 6 years ago

Thanks for the detailed response!

the script has been changed to enable -frtti

Was that a typo perhaps, and you meant -fno-rtti? (Box2D doesn't need rtti or exceptions, so that's really how it should be built, and how game engines use it in practice. I updated the makefile in emscripten and opened an issue to update box2d.js as well.)

Updated box2d WASM size with and without printf:

Which emscripten version was that with? On the latest of both (Cheerp 1523865001-1, emscripten 1.37.37), here is what I see:

Compiler    With printf   Without printf
Emscripten     47089          40454
Cheerp         54965          46084

The Cheerp results are similar to yours, except a little better - maybe since I tested on a newer version. But your emscripten results without printf are surprisingly poor - maybe also part of the difference is I'm using a newer version, but I don't think we landed any major optimizations recently, so that is strange.

Aside from measuring size in bytes, I also gave more in-depth details above, that I don't think you responded to, curious to get your perspective on them, and to check if I got something wrong:

Comparing individual Box2D functions one-to-one in the wasm files (which minimizes the effect of rtti, toolchain choices like what libc is used, etc.), the emscripten ones are significantly smaller.
Emscripten's advantage there is primarily because of the binaryen optimizer, which is confirmed by seeing that binaryen shrinks Cheerp binaries by a significant amount, showing that you have room to improve (or run that optimizer too).

Is the implementation of printf shipped with Emscripten complete or is it simplified?

It's the musl libc printf implementation - should be complete AFAIK.

In our branch we have fixed zlib building with this patch

Thanks for the link. I'm conflicted on testing with patches like these, though: on one hand, more comparisons is good, but on the other, I want to test on real-world code, without special porting to emscripten or cheerp.

About the type safety patches that are needed for plain JS generation, are you open to dicuss integrating them as well?

Continuing my last response, I am open to code to run Cheerp with the right flags etc., and maybe minimal benchmark changes make sense (like removing printf), but I'd rather not modify zlib, bullet, box2d etc. significantly, since emscripten's goal is to run them well without porting (and the version in the test suite is used both for benchmarking and for testing).

Perhaps, instead, we could create a separate repo for cross-compiler comparisons?

Cheerp disables RTTI support by default, that is a deliberate choice.

I see, thanks. Makes sense now.

Cheerp uses newlib for its C library. The implementation of malloc is also dlmalloc

Interesting. Perhaps we use different versions of dlmalloc then, or build it differently - we use -O2, which flag do you use?

Additional test cases: We will investigate why Havlak is not working.

I see that fasta_float has been fixed in Cheerp recently, nice! Aside from Havlak, though, I still see base64 fail as mentioned above, and also Box2D without special changes, i.e. it fails in emscripten's box2d which is unmodified from upstream, but works in yours - is it expected that Cheerp's wasm support needs code to be ported for it to work?

alexp-sssup commented 6 years ago

I apologize for taking so long to reply. To answer with the appropiate level of detail and precision I needed to dedicate significant time, which I could not find until now.

-frtti: This is not a typo. Most compilers enable RTTI by default and you can disable it with -fno-rtti, since Cheerp disables RTTI by default you can enable it with the opposite option. In the updated comparisons above -frtti was enabled for consistency with Emscripten's settings at the time of the previous post.
Emscripten 1.37.37: In my tests it does seems like 1.37.37 generates much smaller code compared to 1.37.36. If that is not expected I would recommend to verify if a lot of code is being removed due to lack of side-effects since printf is disable.
Analysis of size difference on specific functions: After inspecting the biggest functions of Box2D I think that roughly 50% of the difference in size for these functions is caused by a single missed optimization in Cheerp, namely reordering of operations to favor tee_local vs. get_local/set_local pairs. We don't foresee any particular problem is improving the codegen to support this case. We will keep studying these functions as they are indeed a good source of ideas on what can be done to shrink the code even further.
Creating a common repo for benchmark: We are absolutely open to discuss this further.
Build flags for malloc: We use -O2 to build the whole C library. This is the relevant build log from our nightly PPA https://launchpadlibrarian.net/372351599/buildlog_ubuntu-artful-amd64.cheerp-newlib_1527620385-1~artful_BUILDING.txt.gz
Building base64 test with Cheerp: base64 actually builds correctly. It fails at runtime because it uses the clock function which we do not provide as part of our libc. The following snippet of code will fix the build:

#ifdef __CHEERP__
#include <cheerp/client.h>
inline clock_t clock()
{
        double t = cheerp::date_now();
        return (long long)(t*CLOCKS_PER_SEC/1000);
}
#endif

Building unmodified Box2D test with Cheerp: Same as before, you need to provide an implementation of clock.
Required porting when compiling with Cheerp in Wasm mode: Cheerp in Wasm mode is very robust to all sort of unsafe code and no porting on the language side is expected to be necessary. On the other hand some amount of porting on the platform side (i.e. system APIs) may be required. This was indeed the case for both base64 and box2d. We actually provide extended support for POSIX APIs to our commercial customers with an add-on library.

kripken commented 5 years ago

Oh sorry, I missed that there was a reply here...

I do still think these comparisons are useful, but as you said too, it's hard to find time given all the other priorities we have I guess.