Closed kripken closed 5 years ago
Hello Alon. Sorry for the delay, we needed some time to reproduce results and gather all required data.
We have been pleasantly surprised that Cheerp is now integrated in emscripten's benchmarks. We have been maintaining our own branch of emscripten to run tests, which we have now rebased on 1.37.36
and released. You can find it here.
About size results, the differences arise from the test cases being slightly different. I will focus on test_primes
and test_memops
on this discussion, other benchmarks are available in the branch linked above.
To reproduce our results you need to simply remove printf
statements from the test source. Both in test_primes
and test_memops
printf is used to report an invalid test size and to print the test results. In our modified version we remove the printf calls and directly return the numerical result of the test. As an example here is the source of memops with our changes:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int N, M;
int arg = argc > 1 ? argv[1][0] - '0' : 3;
switch(arg) {
case 0: return 0; break;
case 1: N = 1024*1024; M = 55; break;
case 2: N = 1024*1024; M = 400; break;
case 3: N = 1024*1024; M = 800; break;
case 4: N = 1024*1024; M = 4000; break;
case 5: N = 1024*1024; M = 8000; break;
default: /*printf("error: %d\\n", arg);*/ return -1;
}
int final = 0;
char *buf = (char*)malloc(N);
for (int t = 0; t < M; t++) {
for (int i = 0; i < N; i++)
buf[i] = (i + final)%256;
for (int i = 0; i < N; i++)
final += buf[i] & 1;
final = final % 1000;
}
//printf("final: %d.\\n", final);
return final;
}
Of course, we use the same modified sources when building with Emscripten and with Cheerp.
About larger scale tests, llvm-ar
is not supported, we link libraries using llvm-link
as documented here.
Various test cases need some patching to run with Cheerp. The branch we published contains all required patches. It should be noted that the tests were originally ported to be compiled to plain JS (Cheerp genericjs
mode), so the patches could be heavily reduced if only the wasm target is of interest.
As a side note, the webMain
code currently generated is invalid. We recommend changing it to something like this: https://bitbucket.org/apignotti/emscripten/commits/60620e0860099f20b8fe5855d7e6272ef4b14b6a?at=cheerp-fixes-2018mar-rebased
Thanks @alexp-sssup !
To reproduce our results you need to simply remove printf statements from the test source
I see, so we are indeed measuring something different.
Yes, good point, when using printf in a benchmark that is just a few lines of code like primes
, probably most of the output code size is due to printf itself. printf is still interesting in a way (it is real-world C code), but maybe not that interesting in general.
However, without printf, output from tiny benchmarks like primes
end up being dominated by the runtime overhead, which is also not that interesting in general (since people compiling just a few lines of code is pretty rare - still, I added a primes_nocheck
benchmark to our suite to measure that).
Overall, I'm more interested in moderate or large code size projects (the common case that I see among users), like say Box2D. Do you see the same as what I reported on that one (emscripten being 23% smaller than cheerp)?
About larger scale tests, llvm-ar is not supported, we link libraries using llvm-link
I see. So configure/make like say zlib, lua, etc. benchmarks require won't work on cheerp, and I'd need to write a makefile manually if I want those tests to run?
As a side note, the webMain code currently generated is invalid. We recommend changing it to something like this: https://bitbucket.org/apignotti/emscripten/commits/60620e0860099f20b8fe5855d7e6272ef4b14b6a?at=cheerp-fixes-2018mar-rebased
That links requires me to log in, so I can't view it.
Like primes and memops our version of Box2D is patched. Many of the patches are there for genericjs
type safety, but printf is also disabled. From my tests it seems that most of the size difference you measure comes from printf indeed. https://github.com/alexp-sssup/emscripten/commit/8b686069720902aa50ea5b14b69d3af7b4fd393e#diff-d6aa119c750aae03c61eea396a82d07b
With this patch, in my tests, the size between emscripten and cheerp becomes roughly the same. There is a ~2% difference either up or down depending if you choose the compressed or uncompressed version. As usual the patched version is used when compiling both the emscripten and cheerp builds.
configure/make should actually work, by using a wrapper script. This is documented here.
About the link, I pasted the one from our private repos instead of the public one. I apologize. Here is the correct one: https://github.com/alexp-sssup/emscripten/commit/60620e0860099f20b8fe5855d7e6272ef4b14b6a
Like primes and memops our version of Box2D is patched. Many of the patches are there for genericjs type safety, but printf is also disabled. From my tests it seems that most of the size difference you measure comes from printf indeed. alexp-sssup/emscripten@8b68606#diff-d6aa119c750aae03c61eea396a82d07b
Interesting, yes, I see that when I just remove the printf from box2d (using your patch
// Disable printing to stdout for Cheerp and Emscripten.
#define printf(fmt, ...) (0)
) then the cheerp and emscripten sizes become close.
But this seems odd. Why does cheerp go from 150K to 122K just by removing printf - is that expected?
Also, I'm not sure the benchmark is valid without the printing. Without printf, the LLVM optimizer may be able to remove code that we want to execute, but now has no side effects.
If printf is a problem for cheerp, is there some other way to print stuff, that is efficient for you?
configure/make should actually work, by using a wrapper script. This is documented here.
Thanks, but I still can't get it to work, though. First, --host=cheerp-unknown-none
is a specific flag that I guess some projects support? But e.g. zlib (first I tried) does not. Second, even removing that flag, cheerpwrap
doesn't help with the problem of the configure script emitting stuff that uses llvm-ar and other things that don't work in cheerp. (Does cheerpwrap do anything more than the emscripten benchmark runner already does, which is point CC, CXX to the various cheerp binaries?)
Anyhow, maybe I missed or misunderstood something there. In general, it would be great to have a shared script for these comparisons so we know and agree they are fair - perhaps you want to upstream some of the changes in your fork?
I found some time this weekend to dive into the box2d differences here in more detail.
A large source of differences is in system library code:
Intrinsic name not mangled correctly for type arguments! Should be: llvm.memcpy.p0union._ZN10_mbstate_tUt_E.p0union._ZN10_mbstate_tUt_E.i64 void (%union._ZN10_mbstate_tUt_E*, %union._ZN10_mbstate_tUt_E*, i64, i32, i1)* @llvm.memcpy.p0union._ZN10_mbstate_tUt74_3_E.p0union._ZN10_mbstate_tUt74_3_E.i64
).-O2
and -Os
. Looking at Cheerp output, I would guess it is optimized differently, but I'm not sure how.Overall, system lib differences account for a lot of the size differences between the compilers, but even though that's interesting to know, it's always going to be a tradeoff between compiling for size or speed - if one compiler started to build system libs with -Oz it might emit smaller code but eventually users would notice it isn't as fast, etc. So maybe this isn't that important.
Because of that I also did a dive into the wasm binaries themselves, looking function by function. I focused on the largest functions in box2d, which are
Looking at their binary sizes, Emscripten is smaller on all of them, by 18%, 10%, and 23% respectively.
Another way to look at that is to run the Binaryen optimizer on Cheerp output, and it shrinks it by 15%. That's pretty close to the per-function results, which makes sense if the two compiler's output is mostly similar, except that emscripten also runs the Binaryen optimizer.
To summarize,
Hello Alon, keep in mind that there has been significant changes in our Wasm backend since my last comment, so you will need to use updated packages to reproduce our exact results.
I will try to answer all the issue you raised.
Influence of printf on size: The impact that printf has on the compiled size is indeed a bit surprising. I suspect that the issue is not caused by printf itself, but rather by the additional C library infrastructure that it brings in. Is the implementation of printf shipped with Emscripten complete or is it simplified?
Influence of printf on benchmark validity: This is a good point. Still, execution time barely changes when disabling printf which suggests that the code is being executed nevertherless. To verify this I have modified the benchmark to print out (using a low overhead method cheerp::console_log
) the 4 values which are usually included in the printf-ed string. There is a <1% change in size and no measurable change in speed, which confirms the validity of the printf-less test.
Updated box2d WASM size with and without printf: All sizes are after compression. Note, the script has been changed to enable -frtti
as per your later comment. More details are below.
Compiler | With printf | Without printf | With console_log |
---|---|---|---|
Emscripten | 48328 | 48076 | N/A |
Cheerp | 57851 | 46270 | 46380 |
Low overhead output: As mentioned above Cheerp defines in its headers a convenience method to output to console from all execution modes: cheerp::console_log
. This is implemented using the normal JavaScript interoperability support of Cheerp and it is not special cased.
Zlib/Configure support: The issue here is that zlib does not provide an actual autoconf
configure, but a custom build script which happens to be named configure
. --host=cheerp-unknown-none
is not a special flag which needs to be manually supported, but a way to tell configure
that we are cross compiling. Configure will then use cheerp-unknown-none-
prefixed tools that we provide in CHEERP_PREFIX/libexec
. These tools provide an (incomplete) emulation layer for ar
and ranlib
which are used by configure
. The good news is that you should be able to point the appropriate env variables to these tools to get zlib to build. In our branch we have fixed zlib building with this patch https://github.com/alexp-sssup/emscripten/commit/f1f9c3adcbd600113e8b3a472e1a88762591e69a#diff-431a828f43294d297117e82dbaf768ba
Shared build script: I agree that it would make a lot of sense to have a shared/unified benchmark script, and I think that the work you have done so far to integrate Cheerp already helps a lot. We will try to contribute back the minimal patches that are required for wasm/asmjs support. About the type safety patches that are needed for plain JS generation, are you open to dicuss integrating them as well?
RTTI support: Cheerp disables RTTI support by default, that is a deliberate choice. Of the large scale codebases we have seen so far there are many not using RTTI features so disabling the flag by default seems like a good idea, especially on the Web where size is an important metric. RTTI support can be enabled with the standard -frtti
command line flag. This said, enabling RTTI barely changes the compiled size.
Box2D without -frtti | Box2D with -frtti |
---|---|
45226 | 46380 |
Malloc implementation: Cheerp uses newlib
for its C library. The implementation of malloc
is also dlmalloc
. https://github.com/leaningtech/cheerp-newlib/blob/master/newlib/libc/stdlib/mallocr.c
Additional test cases: We will investigate why Havlak is not working.
Size/speed tradeoff: In general we do not favor code size over other metrics. All our libs are built with O2
/O3
. Os
is only used during the LTO phase to avoid excessive inlining. All the improvements we have done to reduce code size have been on the backend side (i.e. JS and WASM codegen)
Thanks for the detailed response!
the script has been changed to enable -frtti
Was that a typo perhaps, and you meant -fno-rtti
? (Box2D doesn't need rtti or exceptions, so that's really how it should be built, and how game engines use it in practice. I updated the makefile in emscripten and opened an issue to update box2d.js as well.)
Updated box2d WASM size with and without printf:
Which emscripten version was that with? On the latest of both (Cheerp 1523865001-1
, emscripten 1.37.37
), here is what I see:
Compiler With printf Without printf
Emscripten 47089 40454
Cheerp 54965 46084
The Cheerp results are similar to yours, except a little better - maybe since I tested on a newer version. But your emscripten results without printf are surprisingly poor - maybe also part of the difference is I'm using a newer version, but I don't think we landed any major optimizations recently, so that is strange.
Aside from measuring size in bytes, I also gave more in-depth details above, that I don't think you responded to, curious to get your perspective on them, and to check if I got something wrong:
Is the implementation of printf shipped with Emscripten complete or is it simplified?
It's the musl libc printf implementation - should be complete AFAIK.
In our branch we have fixed zlib building with this patch
Thanks for the link. I'm conflicted on testing with patches like these, though: on one hand, more comparisons is good, but on the other, I want to test on real-world code, without special porting to emscripten or cheerp.
About the type safety patches that are needed for plain JS generation, are you open to dicuss integrating them as well?
Continuing my last response, I am open to code to run Cheerp with the right flags etc., and maybe minimal benchmark changes make sense (like removing printf), but I'd rather not modify zlib, bullet, box2d etc. significantly, since emscripten's goal is to run them well without porting (and the version in the test suite is used both for benchmarking and for testing).
Perhaps, instead, we could create a separate repo for cross-compiler comparisons?
Cheerp disables RTTI support by default, that is a deliberate choice.
I see, thanks. Makes sense now.
Cheerp uses newlib for its C library. The implementation of malloc is also dlmalloc
Interesting. Perhaps we use different versions of dlmalloc then, or build it differently - we use -O2
, which flag do you use?
Additional test cases: We will investigate why Havlak is not working.
I see that fasta_float
has been fixed in Cheerp recently, nice! Aside from Havlak, though, I still see base64
fail as mentioned above, and also Box2D without special changes, i.e. it fails in emscripten's box2d which is unmodified from upstream, but works in yours - is it expected that Cheerp's wasm support needs code to be ported for it to work?
I apologize for taking so long to reply. To answer with the appropiate level of detail and precision I needed to dedicate significant time, which I could not find until now.
-frtti: This is not a typo. Most compilers enable RTTI by default and you can disable it with -fno-rtti
, since Cheerp disables RTTI by default you can enable it with the opposite option. In the updated comparisons above -frtti
was enabled for consistency with Emscripten's settings at the time of the previous post.
Emscripten 1.37.37: In my tests it does seems like 1.37.37 generates much smaller code compared to 1.37.36. If that is not expected I would recommend to verify if a lot of code is being removed due to lack of side-effects since printf is disable.
Analysis of size difference on specific functions: After inspecting the biggest functions of Box2D I think that roughly 50% of the difference in size for these functions is caused by a single missed optimization in Cheerp, namely reordering of operations to favor tee_local
vs. get_local/set_local
pairs. We don't foresee any particular problem is improving the codegen to support this case. We will keep studying these functions as they are indeed a good source of ideas on what can be done to shrink the code even further.
Creating a common repo for benchmark: We are absolutely open to discuss this further.
Build flags for malloc: We use -O2 to build the whole C library. This is the relevant build log from our nightly PPA https://launchpadlibrarian.net/372351599/buildlog_ubuntu-artful-amd64.cheerp-newlib_1527620385-1~artful_BUILDING.txt.gz
Building base64 test with Cheerp: base64
actually builds correctly. It fails at runtime because it uses the clock
function which we do not provide as part of our libc. The following snippet of
code will fix the build:
#ifdef __CHEERP__
#include <cheerp/client.h>
inline clock_t clock()
{
double t = cheerp::date_now();
return (long long)(t*CLOCKS_PER_SEC/1000);
}
#endif
Building unmodified Box2D test with Cheerp: Same as before, you need to provide an implementation of clock
.
Required porting when compiling with Cheerp in Wasm mode: Cheerp in Wasm mode is very robust to all sort of unsafe code and no porting on the language side is expected to be necessary. On the other hand some amount of porting on the platform side (i.e. system APIs) may be required. This was indeed the case for both base64
and box2d
. We actually provide extended support for POSIX APIs to our commercial customers with an add-on library.
Oh sorry, I missed that there was a reply here...
I do still think these comparisons are useful, but as you said too, it's hard to find time given all the other priorities we have I guess.
Hi Cheerp devs :)
I see your website says "30% smaller than Emscripten" so I was curious to measure that. Running emscripten's
tests/test_benchmark.py
script, I see the following results which are very different:.wasm
and.js
files together, but it's also true when just looking at the wasm.)test_primes
and 27% smaller ontest_memops
.test_box2d
, emscripten is 23% smaller there.That's for size. Speed-wise, the results are a mix (some a little faster, some a little slower, some about the same).
Several of the tests hit problems when running the Cheerp output:
test_base64
hitsError: this should be unreachable
.test_fasta_float
hitsRuntimeError: index out of bounds
.test_havlak
hangs.test_box2d
hitsRuntimeError: indirect call signature mismatch
.I also couldn't get some tests to build in Cheerp:
test_linpack
,test_bullet
,test_lua_binarytrees
,test_lua_scimark
,test_zlib
. For exampletest_zlib
saysMaybe the script doesn't build them right? It calls Cheerp's llvm-ar etc., and the commands work with emscripten and native builds, but maybe something more needs to be done for Cheerp? Or am I hitting Cheerp limitations - would zlib, Lua, etc. need to be ported to Cheerp first?
This is on Cheerp
1516806243-1~xenial
(which is the latest nightly build I see for Xenial, from Jan 24) and emscripten 1.37.36 (last tagged release, from Mar 13).These results are very different from the ones you reported, perhaps we are not measuring the same thing somehow? (All the details of how I got the measurements mentioned above are in the linked script that runs those benchmarks, I basically just ran that script as-is except for uncommenting the line to enable running Cheerp.) Or maybe our results are about different versions?