jerryscript-project / jerryscript

Ultra-lightweight JavaScript engine for the Internet of Things.
https://jerryscript.net
Apache License 2.0
6.95k stars 673 forks source link

V8 benchmark engines comparison #4386

Open lygstate opened 3 years ago

lygstate commented 3 years ago

Refer to https://bellard.org/quickjs/bench.html

I am using the following branch to bench jerryscript

https://github.com/lygstate/jerryscript/tree/benchmark Bench result: Currently, the QuickJs Splay splay tree benchmark case are too slow exceptionally, this is more like an jerrscript issue

JerryScript ESNext QuickJS Ducktape NodeJs Jitless Node
Compile/Run Option python tools/build.py --clean --lto=OFF --jerry-debugger=ON --jerry-cmdline=ON
--jerry-cmdline-snapshot=ON --jerry-math=ON --jerry-ext=ON
--amalgam=ON --snapshot-exec=ON --stack-limit=512 --gc-mark-limit=64
--cpointer-32bit=ON --system-allocator=ON --external-context=ON
--regexp-strict-mode=ON --js-parser=ON --line-info=ON --error-messages=ON
--logging=ON --cmake-param=-GNinja --cmake-param=-DJERRY_LCACHE=1
--cmake-param=-DJERRY_PROPRETY_HASHMAP=1 --profile=es.next
qjs combined.js gcc -O3 -DDUK_CMDLINE_CONSOLE_SUPPORT
duktape.c duk_cmdline.c duk_console.c
-lm -lc -o duk
node --jitless combined.js node combined.js
Engine - Peak Memory Consumption (KB)
Engine - Peak Stack Usage (KB)
Binary size (Byte) 239664 5523792 582240 73873464 73873464
Standard compatibly ES 2020 ES 2020 Part ES 2015 ES 2020 ES 2020
Richards 173 876 223 1188 36562
DeltaBlue 180 866 282 1345 67302
Crypto 182 1016 385 923 46090
RayTrace 291 1173 617 2880 52043
EarleyBoyer 408 1841 686 5101 55366
RegExp 155 255 221 2513 8866
Splay 365 1919 1296 4673 32091
NavierStokes 398 1701 1073 1657 51740
TotalScore 250 1042 487 2129 38483
lygstate commented 3 years ago

With the following cmake configure options: ["-DJERRY_LINE_INFO=ON", "-DJERRY_GLOBAL_HEAP_SIZE=102400", "-DJERRY_GC_MARK_LIMIT=8"] The tests are running, but the SplayTree benchmark test can not running, it's too slow: JerryScript

PROGRESS Richards
RESULT Richards 205
PROGRESS DeltaBlue
RESULT DeltaBlue 179
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 250
PROGRESS RayTrace
RESULT RayTrace 349
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 141
PROGRESS RegExp
RESULT RegExp 19.6
PROGRESS NavierStokes
RESULT NavierStokes 544
SCORE 174

QuickJs:

PROGRESS Richards
RESULT Richards 868
PROGRESS DeltaBlue
RESULT DeltaBlue 872
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 1014
PROGRESS RayTrace
RESULT RayTrace 1144
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 1789
PROGRESS RegExp
RESULT RegExp 244
PROGRESS NavierStokes
RESULT NavierStokes 1681
SCORE 940

NodeJS jitless

PROGRESS Richards
RESULT Richards 1139
PROGRESS DeltaBlue
RESULT DeltaBlue 1300
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 898
PROGRESS RayTrace
RESULT RayTrace 2751
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 4831
PROGRESS RegExp
RESULT RegExp 3387
PROGRESS NavierStokes
RESULT NavierStokes 1599
SCORE 1919

NodeJs

PROGRESS Richards
RESULT Richards 38059
PROGRESS DeltaBlue
RESULT DeltaBlue 62548
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 45647
PROGRESS RayTrace
RESULT RayTrace 51651
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 56458
PROGRESS RegExp
RESULT RegExp 8929
PROGRESS NavierStokes
RESULT NavierStokes 51198
SCORE 39303
rerobika commented 3 years ago

First of all, I highly suggest to turn JERRY_LINE_INFO off since the source info decoding during the bytecode execution causes serious slowdown.

My two concerns: 1, I don't see the results of Ducktape. 2, This comparison in itself does not tell anything.

Comparing a low-end engine with a high-end engine (V8) is quite unfair. Both engines have a different design pattern. In high end engines the main aspect is the performance which has significant memory/stack usage and enormous binary size. JerryScript was designed to low-end devices with restricted resources so the main priority is the low memory usage and the small binary footprint. However, keeping these numbers low work against the performance.

So I suggest to compare only the low-end engines. What I'd call a fair and detailed comparison is:

Engine - Score Engine - Peak Memory Consumption (KB) Engine - Peak Stack Usage (KB)
Richards
DeltaBlue
...

Where Engine is one of Jerry, Ducktape or QuickJS.

Also good to mention the supported revision of the standard by the tested engines, since the new language elements above ES6+ are real challenges to the developers to support them without serious engine slowdown.

Binary size (KB) Standard compatibly
Jerry
Ducktape
QuickJS
lygstate commented 3 years ago

JERRY_LINE_INFO

JERRY_LINE_INFO affect little, and Splay are toooo slow, this should be a bug.

akosthekiss commented 3 years ago

Let's get just a wee bit more professional. "too** slow" is not anything helpful. Same goes for "this should be a bug." Analysis is welcome.

rerobika commented 3 years ago

Splay is a specific test for GC. It was designed to test the evolutional GC. Since JerryScript uses single mark&sweep model it wouldn't perform well in this test. However increasing the JERRY_GC_MARK_LIMIT can help this problem but it will also increase the stack usage.

lygstate commented 3 years ago

Splay is a specific test for GC. It was designed to test the evolutional GC. Since JerryScript uses single mark&sweep model it wouldn't perform well in this test. However increasing the JERRY_GC_MARK_LIMIT can help this problem but it will also increase the stack usage.

What JERRY_GC_MARK_LIMIT value suggest to benchmark it?

rerobika commented 3 years ago

I have no optional number to say. Increasing the recursion limit will increase the score and stack usage simultaneously. So keep fine tuning it, I suggest to start doubling it continuously.

lygstate commented 3 years ago

@rerobika For the record, increased to 1024 have on effect

lygstate@DESKTOP-94PU0GB:/mnt/c/work/study/languages/typescript/jerryscript/build/linux$ cmake -GNinja ../.. -DJERRY_LINE_INFO=OFF -DCMAKE_BUILD_TYPE=Release -DJERRY_EXTERNAL_CONTEXT=OFF -DJERRY_SYSTEM_ALLOCATOR=OFF -DJERRY_GLOBAL_HEAP_SIZE=512000 -DJERRY_GC_MARK_LIMIT=1024
-- The C compiler identification is GNU 9.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- CMAKE_BUILD_TYPE               Release
-- CMAKE_C_COMPILER_ID            GNU
-- CMAKE_SYSTEM_NAME              Linux
-- CMAKE_SYSTEM_PROCESSOR         x86_64
-- BUILD_SHARED_LIBS              OFF
-- ENABLE_LTO                     ON
-- ENABLE_STRIP                   ON
-- JERRY_VERSION                  2.4.0
-- JERRY_CMDLINE                  ON
-- JERRY_CMDLINE_TEST             OFF
-- JERRY_CMDLINE_SNAPSHOT         OFF
-- JERRY_LIBFUZZER                OFF (FORCED BY COMPILER)
-- JERRY_PORT_DEFAULT             ON (FORCED BY CMDLINE OR LIBFUZZER OR TESTS)
-- JERRY_EXT                      ON (FORCED BY CMDLINE OR TESTS)
-- JERRY_LIBM                     ON
-- UNITTESTS                      OFF
-- DOCTESTS                       OFF
-- ENABLE_ALL_IN_ONE              OFF
-- JERRY_CPOINTER_32_BIT          ON (FORCED BY HEAP SIZE)
-- JERRY_DEBUGGER                 OFF
-- JERRY_ERROR_MESSAGES           OFF
-- JERRY_EXTERNAL_CONTEXT         OFF
-- JERRY_PARSER                   ON
-- JERRY_LINE_INFO                OFF
-- JERRY_LOGGING                  OFF
-- JERRY_MEM_STATS                OFF
-- JERRY_MEM_GC_BEFORE_EACH_ALLOC OFF
-- JERRY_PARSER_DUMP_BYTE_CODE    OFF
-- JERRY_PROFILE                  es.next
-- JERRY_REGEXP_STRICT_MODE       OFF
-- JERRY_REGEXP_DUMP_BYTE_CODE    OFF
-- JERRY_SNAPSHOT_EXEC            OFF
-- JERRY_SNAPSHOT_SAVE            OFF
-- JERRY_SYSTEM_ALLOCATOR         OFF
-- JERRY_VALGRIND                 OFF
-- JERRY_VM_EXEC_STOP             OFF
-- JERRY_GLOBAL_HEAP_SIZE         512000
-- JERRY_GC_LIMIT                 (0)
-- JERRY_STACK_LIMIT              (0)
-- JERRY_GC_MARK_LIMIT            1024
-- FEATURE_INIT_FINI              OFF
-- Performing Test HAVE_TM_GMTOFF
-- Performing Test HAVE_TM_GMTOFF - Success
-- Looking for include file time.h
-- Looking for include file time.h - found
-- Looking for include file unistd.h
-- Looking for include file unistd.h - found
-- ENABLE_LINK_MAP                OFF
-- JERRY_TEST_STACK_MEASURE       OFF
-- Configuring done
-- Generating done
-- Build files have been written to: /mnt/c/work/study/languages/typescript/jerryscript/build/linux
lygstate@DESKTOP-94PU0GB:/mnt/c/work/study/languages/typescript/jerryscript/build/linux$ ninja
[244/244] Linking C executable bin/jerry
lygstate@DESKTOP-94PU0GB:/mnt/c/work/study/languages/typescript/jerryscript/build/linux$ ^C
lygstate@DESKTOP-94PU0GB:/mnt/c/work/study/languages/typescript/jerryscript/build/linux$ ^C
lygstate@DESKTOP-94PU0GB:/mnt/c/work/study/languages/typescript/jerryscript/build/linux$ ./bin/jerry ../../tests/benchmarks/v8/combined.js
PROGRESS Richards
RESULT Richards 262
PROGRESS DeltaBlue
RESULT DeltaBlue 211
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 306
PROGRESS RayTrace
RESULT RayTrace 391
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 148
PROGRESS RegExp
RESULT RegExp 36.7
PROGRESS Splay
RESULT Splay 0.106
PROGRESS NavierStokes
RESULT NavierStokes 480
SCORE 80.8
lygstate commented 3 years ago

Now found the cause of splay too slow

python tools/build.py ^
    --clean ^
    --lto=OFF ^
    --jerry-debugger=ON ^
    --jerry-cmdline=ON ^
    --jerry-cmdline-snapshot=ON ^
    --jerry-math=ON ^
    --jerry-ext=ON ^
    --amalgam=ON ^
    --snapshot-exec=ON ^
    --stack-limit=512 ^
    --gc-mark-limit=64 ^
    --mem-heap=2048 ^
    --cpointer-32bit=ON ^
    --system-allocator=ON ^
    --external-context=ON ^
    --regexp-strict-mode=ON ^
    --js-parser=ON ^
    --line-info=ON ^
    --error-messages=ON ^
    --logging=ON ^
    --cmake-param=-GNinja ^
    --cmake-param=-DJERRY_LCACHE=1 ^
    --cmake-param=-DJERRY_PROPRETY_HASHMAP=1 ^
    --profile=es.next

I guess mainly because of -DJERRY_LCACHE=1 and -DJERRY_PROPRETY_HASHMAP=1

benchmark result:

C:\work\study\languages\typescript\jerryscript>build\bin\jerry.exe tests\benchmarks\v8\combined.js
PROGRESS Richards
RESULT Richards 173
PROGRESS DeltaBlue
RESULT DeltaBlue 180
PROGRESS Encrypt
PROGRESS Decrypt
RESULT Crypto 182
PROGRESS RayTrace
RESULT RayTrace 291
PROGRESS Earley
PROGRESS Boyer
RESULT EarleyBoyer 408
PROGRESS RegExp
RESULT RegExp 155
PROGRESS Splay
RESULT Splay 365
PROGRESS NavierStokes
RESULT NavierStokes 398
SCORE 250
ossy-szeged commented 3 years ago

cc @dbatyai

rerobika commented 3 years ago

I guess mainly because of -DJERRY_LCACHE=1 and -DJERRY_PROPRETY_HASHMAP=1 What does it mean exactly? If you consider this as the source of the slowness a with/without comparison would be great.

Moreover I've tested the engine on splay with lcache and hashmap and without them and there were no significant difference. But feel free to share how that you came to this conclusion.

dbatyai commented 3 years ago

As it was said before, the splay test was created specifically to test garbage collection, and thus uses a lot of memory and creates a lot of fragmentation. The fact that it is slow has nothing to do with the lcache, and very little with hashmaps (these can have a slight effect on fragmentation).

The reason this test is slow is that the jerry allocator was not designed to handle this much memory, and maintaining the free block list gets more and more costly as the memory get fragmented. I have a few ideas on how to improve the logic behind the allocator to make it less affected by fragmentation, but can't really give anything specific for now.

However the fact remains the same, the core idea behind the allocator will still be handling smaller amounts of memory with as little overhead as possible, and not high performance. When larger amounts of memory is required or performance is more critical then the system allocator should be used instead.

lygstate commented 3 years ago

As it was said before, the splay test was created specifically to test garbage collection, and thus uses a lot of memory and creates a lot of fragmentation. The fact that it is slow has nothing to do with the lcache, and very little with hashmaps (these can have a slight effect on fragmentation).

The reason this test is slow is that the jerry allocator was not designed to handle this much memory, and maintaining the free block list gets more and more costly as the memory get fragmented. I have a few ideas on how to improve the logic behind the allocator to make it less affected by fragmentation, but can't really give anything specific for now.

However the fact remains the same, the core idea behind the allocator will still be handling smaller amounts of memory with as little overhead as possible, and not high performance. When larger amounts of memory is required or performance is more critical then the system allocator should be used instead.

Hi, verified, you are right, but the current situation is on 64bit processor, we can not using system allocator, that's why I have forced to using jerry memory allocator and leading to significant fragmentation

excitedbox commented 3 years ago

Sorry to bring up an old topic but lvgl switched memory allocators recently which gave a pretty big speed boost. Could have a look at what they used. I forgot the name of the library right now.