Open mame opened 2 years ago
I checked the reason of this performance issue by perf.
As a result, the main bottleneck is rb_check_typeddata
.
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles'
# Event count (approx.): 4044660818
#
# Overhead Command Shared Object Symbol
# ........ ............... .................. ....................................................
#
7.02% ruby libruby.so.3.0.2 [.] rb_check_typeddata
4.35% swapper [kernel.kallsyms] [k] 0xffffffff8f3faac9
3.88% ruby narray.so [.] ndloop_run
3.65% ruby narray.so [.] ndloop_set_stepidx.isra.0
3.53% ruby narray.so [.] ndloop_init_args.isra.0
3.45% ruby libruby.so.3.0.2 [.] rb_obj_is_kind_of
2.80% ruby libruby.so.3.0.2 [.] rb_typeddata_inherited_p
2.69% ruby narray.so [.] ndloop_alloc
2.60% ruby libruby.so.3.0.2 [.] gc_sweep_step
2.60% ruby narray.so [.] ndloop_set_output_narray
2.57% ruby libruby.so.3.0.2 [.] vm_exec_core
2.38% ruby libruby.so.3.0.2 [.] vm_call0_body
2.13% ruby libruby.so.3.0.2 [.] rb_funcallv
1.52% ruby libruby.so.3.0.2 [.] ary_memcpy0
1.49% ruby narray.so [.] ndloop_release
1.44% ruby libruby.so.3.0.2 [.] rb_gc_writebarrier
1.36% ruby narray.so [.] na_ndloop_main
1.33% ruby libruby.so.3.0.2 [.] rb_yield_1
1.30% ruby libruby.so.3.0.2 [.] rb_typeddata_inherited_p@plt
1.19% ruby libruby.so.3.0.2 [.] ruby_yyparse
1.17% ruby libruby.so.3.0.2 [.] rb_obj_class
1.14% ruby narray.so [.] na_release_lock
1.10% swapper [kernel.kallsyms] [k] 0xffffffff8f3fa754
1.00% ruby narray.so [.] loop_narray
0.99% ruby libruby.so.3.0.2 [.] vm_call_cfunc_with_frame
0.95% ruby libruby.so.3.0.2 [.] rb_ensure
0.92% ruby narray.so [.] nary_get_pointer_for_read_write
0.87% ruby narray.so [.] iter_dfloat_add
0.84% ruby libruby.so.3.0.2 [.] rb_class_real
0.81% ruby libruby.so.3.0.2 [.] ary_ensure_room_for_push
0.81% ruby narray.so [.] nary_get_pointer_for_read
0.80% ruby libruby.so.3.0.2 [.] rb_wb_protected_newobj_of
0.74% ruby narray.so [.] na_ndloop
0.73% ruby narray.so [.] dfloat_add
0.72% ruby libruby.so.3.0.2 [.] vm_yield_setup_args
0.71% ruby libruby.so.3.0.2 [.] rb_ary_push
0.71% ruby libruby.so.3.0.2 [.] rb_ary_tmp_new_from_values
0.70% ruby libruby.so.3.0.2 [.] ruby_sized_xfree
0.68% ruby narray.so [.] nary_test_reduce
0.65% ruby libruby.so.3.0.2 [.] rb_vm_exec
0.65% ruby libruby.so.3.0.2 [.] rb_obj_alloc
0.63% ruby libc-2.31.so [.] malloc
I investigated with non-optimized ruby to show the full call stack. The result is:
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 6K of event 'cycles'
# Event count (approx.): 5610555644
#
# Children Self Command Shared Object Symbol
# ........ ........ ............... ........................ ....................................................................
#
66.33% 0.00% ruby [unknown] [.] 0x000055756b40b6f0
|
---0x55756b40b6f0
|
|--60.11%--vm_call_cfunc_with_frame
| |
| --58.78%--dfloat_add
| |
| --58.54%--dfloat_add_self
| |
| --57.81%--na_ndloop
| |
| |--41.13%--rb_ensure
| | |
| | |--33.67%--ndloop_run
| | | |
| | | |--12.31%--ndloop_set_output
| | | | |
| | | | --10.42%--ndloop_set_output_narray
| | | | |
| | | | |--5.20%--ndloop_set_stepidx
| | | | | |
| | | | | --4.58%--nary_get_pointer_for_read_write
| | | | | |
| | | | | --3.45%--na_get_pointer_for_rw
| | | | | |
| | | | | --2.22%--RB_OBJ_FROZEN
| | | | | |
| | | | | --0.92%--RB_TYPE_P
| | | | | |
| | | | | --0.82%--rb_type
| | | | |
| | | | |--1.78%--ndloop_find_inplace
| | | | |
| | | | --1.14%--rbimpl_size_mul_or_raise
| | | |
| | | |--7.89%--ndloop_init_args
| | | | |
| | | | |--1.91%--ndloop_set_stepidx
| | | | | |
| | | | | --1.13%--nary_get_pointer_for_read
| | | | |
| | | | |--0.78%--ndfunc_set_bufcp
| | | | |
| | | | |--0.66%--rb_type
| | | | |
| | | | |--0.65%--rbimpl_size_mul_or_raise
| | | | |
| | | | |--0.64%--iter_dfloat_add
| | | | |
| | | | --0.52%--ndfunc_set_user_loop
| | | |
| | | |--2.55%--loop_narray
| | | | |
| | | | --0.61%--iter_dfloat_add
| | | |
| | | |--0.93%--ndloop_cast_args
| | | |
| | | --0.59%--ndloop_alloc
| | |
| | |--2.45%--ndloop_release
| | | |
| | | --1.67%--na_release_lock
| | | |
| | | |--0.76%--na_release_lock
| | | | |
| | | | --0.64%--ndloop_alloc
| | | |
| | | --0.62%--ndloop_alloc
| | |
| | --1.82%--ndloop_alloc
| |
| |--13.89%--na_ndloop_main
| | |
| | |--4.67%--ndloop_alloc
| | | |
| | | --2.46%--ndloop_find_max_dimension
| | | |
| | | --0.72%--rb_array_len
| | |
| | |--2.11%--ndloop_cast_args
| | | |
| | | --0.85%--rb_type
| | |
| | |--0.93%--ndloop_set_output_narray
| | |
| | |--0.60%--ndloop_set_stepidx
| | |
| | --0.55%--ndloop_find_inplace
| |
| --0.93%--rbimpl_size_mul_or_raise
|
--0.95%--ndloop_alloc
From the result with non-optimized ruby, rb_check_typeddata
doesn't seem to be the main bottleneck.
I found that
x.inplace + i
is much slower thanx.add!(i)
.I have no idea if this is a bug, but @mrkn asked me to create a ticket.