Closed SamSaffron closed 3 years ago
I'd suggest turning off the cache in the short term. I noticed the crash location is in a different spot than the previous post in the PR. Is that what you see in the various crashes?
How large is the JSON being parsed and what is the rate? Just wondering about the interaction with GC.
not sure how large the JSON is cause it is happening in background processes (in sidekiq)
How do we turn off the cache?
The previous backtrace was from a stuck process, this one is from segfaulted process. Have not done full analysis on all the core dumps we got, will see if we have any more data here.
Thanks. Turn off the cache with Oj.default_options = { cache_keys: false, cache_strings: 0 }
End of the day here so will have to call it a night pretty soon but will pick it up tomorrow. If the cache turned off fixes the issue that at least narrows it down. I'll look whether Ruby 2.7.x GC occurs during normal evaluations as well.
It would be helpful if you could provide the steps and JSON that can be reproduced.
I have a backtrace from the same background process (running on a different machine) that segfaulted from the OP, however this was in a different location. I can't see any reference to OJ in the full backtrace, but this occurred around the same time after upgrading to 3.13.3.
This looks like the cause was during GC.
(gdb) bt 25
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007fd4d981d535 in __GI_abort () at abort.c:79
#2 0x00007fd4d9e03756 in die () at error.c:647
#3 rb_bug (fmt=fmt@entry=0x7fd4da06b6ae "try to mark T_NONE object") at error.c:647
#4 0x00007fd4d9e03b27 in gc_mark_ptr (objspace=0x7fd4d8c8a100, obj=140551651948288) at gc.c:5175
#5 0x00007fd4d9eb31eb in gc_mark (obj=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:5205
#6 gc_mark_values (values=<optimized out>, n=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:4713
#7 gc_mark_imemo (obj=140551651945640, objspace=0x7fd4d8c8a100) at gc.c:5251
#8 gc_mark_children (objspace=0x7fd4d8c8a100, obj=140551651945640) at gc.c:5334
#9 0x00007fd4d9ebab71 in gc_mark_stacked_objects (incremental=1, count=3693, objspace=0x7fd4d8c8a100) at gc.c:5506
#10 gc_mark_stacked_objects_incremental (count=3693, objspace=0x7fd4d8c8a100) at gc.c:5540
#11 gc_marks_step (slots=<optimized out>, objspace=0x7fd4d8c8a100) at gc.c:6445
#12 gc_marks_continue (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:6505
#13 heap_prepare (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:2020
#14 heap_get_freeobj_from_next_freepage (heap=<optimized out>, objspace=<optimized out>) at gc.c:2037
#15 heap_get_freeobj (heap=0x7fd4d8c8a128, objspace=0x7fd4d8c8a100) at gc.c:2076
#16 newobj_slowpath (wb_protected=1, objspace=0x7fd4d8c8a100, v3=0, v2=0, v1=0, flags=5, klass=140552148301240) at gc.c:2220
#17 newobj_slowpath_wb_protected (klass=140552148301240, flags=5, v1=0, v2=0, v3=0, objspace=0x7fd4d8c8a100) at gc.c:2232
#18 0x00007fd4d9fd5fa5 in str_alloc (klass=<optimized out>) at string.c:750
#19 str_new0 (klass=<optimized out>,
ptr=0x7fd475d3c600 "{\"retry\":true,\"queue\":\"default\",\"class\":\"Jobs::UnpinTopic\",\"args\":[{\"topic_id\":5875,\"current_site_id\":\"akimbo26\"}],\"jid\":\"cd286b99ac003d8fe731fefe\",\"created_at\":1584936702.2181027}\r\n$18\r\n33141873599.2"..., len=180, termlen=1) at string.c:750
#20 0x00007fd4d9fe91e1 in str_substr (empty=1, len=<optimized out>, beg=140550486607360, str=<optimized out>) at string.c:2607
#21 rb_str_substr (str=str@entry=140551056786840, beg=beg@entry=0, len=<optimized out>) at string.c:2586
#22 0x00007fd4d9feeefd in rb_str_aref_m (argc=2, str=140551056786840, argv=0x7fd48fefb360) at string.c:4565
#23 rb_str_aref_m (str=140551056786840, argv=0x7fd48fefb360, argc=2) at string.c:4556
#24 rb_str_slice_bang (argc=2, argv=<optimized out>, str=140551056786840) at string.c:4873
If you need more information, I will have to try and run the background task under gdb and catch a fault as it occurs.
We also had some SEGV after updating to 3.13.3
(cc @etiennebarrie @adrianna-chang-shopify), here's the backtrace (very similar to Sam's):
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f64c3a0e859 in __GI_abort () at abort.c:79
#2 0x000055a89436dd31 in die () at error.c:773
#3 rb_bug (fmt=fmt@entry=0x55a8945f8960 "try to mark T_NONE object") at error.c:773
#4 0x000055a89438f5a8 in gc_mark_ptr (objspace=0x7f64c326c000, obj=140070595921664) at gc.c:6114
#5 0x000055a894390ee0 in gc_mark_children (objspace=0x7f64c326c000, obj=140066757234120) at gc.c:6378
#6 0x000055a8943919f8 in gc_mark_stacked_objects (incremental=1, count=2147483647, objspace=0x7f64c326c000) at gc.c:6471
#7 gc_mark_stacked_objects_incremental (count=2147483647, objspace=0x7f64c326c000) at gc.c:6505
#8 gc_marks_rest (objspace=0x7f64c326c000) at gc.c:7448
#9 0x000055a894395b42 in gc_marks_continue (objspace=0x7f64c326c000, heap=0x7f64c326c028) at gc.c:7492
#10 0x000055a894395ddb in heap_prepare (heap=<optimized out>, objspace=0x7f64c326c000) at gc.c:2044
#11 heap_next_freepage (heap=<optimized out>, objspace=<optimized out>) at gc.c:2200
#12 ractor_cache_slots (cr=<optimized out>, cr=<optimized out>, objspace=<optimized out>) at gc.c:2219
#13 newobj_slowpath (wb_protected=1, cr=0x7f64c3377000, objspace=0x7f64c326c000, flags=8193, klass=140070569509720) at gc.c:2265
#14 newobj_slowpath_wb_protected (klass=140070569509720, flags=8193, objspace=0x7f64c326c000, cr=0x7f64c3377000) at gc.c:2284
#15 0x000055a894395fdd in newobj_of0 (cr=<optimized out>, wb_protected=1, flags=8193, klass=<optimized out>) at gc.c:2322
#16 newobj_of (wb_protected=1, v3=52, v2=52, v1=52, flags=8193, klass=<optimized out>) at gc.c:2333
#17 rb_class_allocate_instance (klass=<optimized out>) at gc.c:2488
#18 0x000055a894400113 in class_call_alloc_func (klass=140070569509720, allocator=<optimized out>) at object.c:2064
#19 rb_class_alloc (klass=140070569509720) at object.c:2036
#20 rb_class_new_instance_pass_kw (argc=2, argv=0x7f64c3277c60, klass=140070569509720) at object.c:2109
#21 0x000055a89451a37b in vm_call_cfunc_with_frame (ec=0x7f64c3271050, reg_cfp=0x7f64c3374928, calling=<optimized out>) at vm_insnhelper.c:2926
#22 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#23 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#24 0x000055a894524c46 in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2163
#25 0x000055a894525682 in invoke_bmethod (ec=0x7f64c3271050, iseq=<optimized out>, self=140066684037480, me=0x7f64bab61770, type=<optimized out>, opt_pc=0, captured=<optimized out>) at vm.c:1293
#26 0x000055a894525e36 in vm_call_bmethod_body (argv=0x7ffd16c31e80, calling=0x7ffd16c31f40, ec=0x7f64c3271050) at vm_callinfo.h:329
#27 vm_call_bmethod (ec=0x7f64c3271050, cfp=0x7f64c3374b58, calling=0x7ffd16c31f40) at vm_insnhelper.c:3005
#28 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#29 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#30 0x000055a894524c46 in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2163
#31 0x000055a894525682 in invoke_bmethod (ec=0x7f64c3271050, iseq=<optimized out>, self=140066684034840, me=0x7f63d341cab0, type=<optimized out>, opt_pc=0, captured=<optimized out>) at vm.c:1293
#32 0x000055a894529d03 in vm_call_bmethod_body (argv=0x7f64c3277948, calling=0x7ffd16c322a0, ec=0x7f64c3271050) at vm_callinfo.h:329
#33 vm_call0_body (ec=0x7f64c3271050, calling=0x7ffd16c322a0, argv=0x7f64c3277948) at vm_eval.c:205
#34 0x000055a89452c899 in vm_call0_cc (kw_splat=<optimized out>, cc=<optimized out>, argv=0x7f64c3277948, argc=0, id=2348191, recv=140066684034840, ec=0x7f64c3271050) at vm_eval.c:72
#35 rb_call0 (ec=0x7f64c3271050, recv=140066684034840, mid=2348191, argc=0, argv=0x7f64c3277948, call_scope=<optimized out>, self=52) at vm_eval.c:505
#36 0x000055a89452ce82 in send_internal (argc=0, argv=0x7f64c3277948, recv=140066684034840, scope=<optimized out>) at vm_eval.c:1245
#37 0x000055a89451a37b in vm_call_cfunc_with_frame (ec=0x7f64c3271050, reg_cfp=0x7f64c3374d18, calling=<optimized out>) at vm_insnhelper.c:2926
#38 0x000055a89452f360 in vm_sendish (block_handler=<optimized out>, method_explorer=<optimized out>, cd=<optimized out>, reg_cfp=<optimized out>, ec=<optimized out>) at vm_callinfo.h:336
#39 vm_exec_core (ec=0x7f64c3271050, initial=140724985337088) at insns.def:789
#40 0x000055a89452547d in rb_vm_exec (ec=<optimized out>, mjit_enable_p=<optimized out>) at vm.c:2172
#41 0x000055a8945285df in invoke_block (captured=<optimized out>, opt_pc=<optimized out>, type=<optimized out>, cref=0x0, self=140070626928760, iseq=0x7f64b684bf80, ec=0x7f64c3271050) at vm.c:1263
#42 invoke_iseq_block_from_c (me=0x0, is_lambda=<optimized out>, cref=0x0, passed_block_handler=0, kw_splat=0, argv=0x7ffd16c326d8, argc=1, self=140070626928760, captured=<optimized out>, ec=0x7f64c3271050) at vm.c:1335
#43 invoke_block_from_c_bh (argc=<optimized out>, kw_splat=<optimized out>, passed_block_handler=<optimized out>, cref=<optimized out>, is_lambda=<optimized out>, force_blockarg=<optimized out>, argv=<optimized out>, block_handler=<optimized out>, ec=<optimized out>) at vm.c:1353
#44 vm_yield (argc=1, kw_splat=0, argv=0x7ffd16c326d8, ec=0x7f64c3271050) at vm.c:1398
Note that we're on 3.0.2.
I personally doubt https://github.com/ohler55/oj/pull/695 is the cause, because we use rb_enc_interned_str()
extensively already via msgpack-ruby
, and we've never seen this.
The backtrace seem to indicate that some Ruby object is in an invalid state, I'm not so familiar with Oj's codebase so not sure which it could be.
We'll try to reproduce this with GC.stress = true
.
That is helpful and is another indicator of what I was starting to expect. The issue is around the GC thread and the value threads. Either GC is running at the same time as he value threads or registers are being used in parts of the code that are not getting updated across threads. The fixes I'll put in later today (after work) will be a mutex protect on the GC thread (marking) and also use volatile for variables that are used by both threads. Will someone be able to try the new branch once it is ready?
here's another snapshot of this happening....
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:84: [BUG] Segmentation fault at 0x0000000000000000
ruby 2.7.4p191 (2021-07-07 revision a21a3b7d23) [x86_64-linux]
-- Control frame information -----------------------------------------------
c:0033 p:---- s:0154 e:000153 CFUNC :parse
c:0032 p:0027 s:0149 e:000148 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:84
c:0031 p:0024 s:0145 e:000144 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:63
c:0030 p:0003 s:0141 e:000140 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:70
c:0029 p:0027 s:0136 e:000135 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:44
c:0028 p:0011 s:0130 e:000129 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:49
c:0027 p:0028 s:0124 e:000122 METHOD /app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/helper.rb:22
c:0026 p:0020 s:0117 e:000116 BLOCK /app/lib/source_map_uploader.rb:34 [FINISH]
c:0025 p:---- s:0113 e:000112 CFUNC :map
c:0024 p:0043 s:0109 e:000108 METHOD /app/lib/source_map_uploader.rb:34
c:0023 p:0003 s:0103 e:000102 METHOD /app/lib/source_map_uploader.rb:17
c:0022 p:0026 s:0099 e:000098 BLOCK /app/lib/tasks/upload_sourcemaps.rake:6
<snip> ... rake ... </snip>
c:0002 p:0025 s:0006 e:000005 EVAL /app/bin/rake:4 [FINISH]
c:0001 p:0000 s:0003 E:001890 (none) [FINISH]
-- Ruby level backtrace information ----------------------------------------
<snip> ... rake ... </snip>
/app/lib/tasks/upload_sourcemaps.rake:6:in `block (2 levels) in <top (required)>'
/app/lib/source_map_uploader.rb:17:in `upload'
/app/lib/source_map_uploader.rb:34:in `files_to_upload'
/app/lib/source_map_uploader.rb:34:in `map'
/app/lib/source_map_uploader.rb:34:in `block in files_to_upload'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/helper.rb:22:in `asset_pack_path'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:49:in `lookup!'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:44:in `lookup'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:70:in `find'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:63:in `data'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:84:in `load'
/app/vendor/bundle/ruby/2.7.0/gems/webpacker-5.4.2/lib/webpacker/manifest.rb:84:in `parse'
-- Machine register context ------------------------------------------------
RIP: 0x00007f4520271998 RBP: 0x0000000000000006 RSP: 0x00007ffc88502fc0
RAX: 0x00007f4524339800 RBX: 0x4cf2894cfa1e0ff3 RCX: 0x0000000000000000
RDX: 0x0000000000000070 RDI: 0x00007f4517206ce4 RSI: 0x0001f921b62bea6d
R8: 0x0000000226eb797e R9: 0x00007f4517206ce2 R10: 0x00007f4517206ce4
R11: 0x0000000000000004 R12: 0x0000000000000006 R13: 0x00007f4517206cde
R14: 0x00007f4524339b80 R15: 0x00007f4524408250 EFL: 0x0000000000010206
-- C level backtrace information -------------------------------------------
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(rb_print_backtrace+0x19) [0x7f452c50c4d9] vm_dump.c:755
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(rb_vm_bugreport+0x21c) [0x7f452c50c70c] vm_dump.c:941
[0x7f452c3861fb]
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(sigsegv+0x4d) [0x7f452c489f2d] signal.c:946
/lib/x86_64-linux-gnu/libc.so.6(0x7f452c123210) [0x7f452c123210]
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(locking_intern+0xb8) [0x7f4520271998] cache.c:197
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(oj_calc_hash_key+0xc5) [0x7f45202a5cb5] strict.c:47
[0x7f4520273055]
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(add_value+0x93) [0x7f45202918b3] parse.c:83
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(oj_parse2+0x82a) [0x7f4520293cda] parse.c:618
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(protect_parse+0xd) [0x7f452029406d] parse.c:959
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(rb_protect+0x163) [0x7f452c38d533] eval.c:1087
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(oj_pi_parse+0x169) [0x7f45202941e9] parse.c:1068
/app/vendor/bundle/ruby/2.7.0/gems/oj-3.13.3/lib/oj/oj.so(mimic_parse_core+0x36e) [0x7f4520287b1e] mimic_json.c:595
[0x7f452c4e7d78]
[0x7f452c4fc27b]
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(vm_sendish+0x76) [0x7f452c4e2ef6] vm_insnhelper.c:4023
[0x7f452c4f3a8a]
[0x7f452c4f9e80]
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(rb_yield+0x271) [0x7f452c506e21] vm.c:1044
/app/vendor/ruby-2.7.4/bin/../lib/libruby.so.2.7(rb_ary_collect+0x5c) [0x7f452c3127ec] array.c:3065
@ohler55 i may be able to run a test version..
the JSON being parsed was a webpacker manifest file, so - reasonably chunky.
Thanks. All good information.
The issue was that the cache was locking the cache, as it must for a multithreaded environment such as what is coming with Ruby ractor. But that also means that memory can not be allocated with the Ruby memory allocation functions/macros since they can trigger a GC which the cache uses to expunge not recently used strings. Long story short, the "thread-protect" branch has what I believe is a fix. It uses volatiles in the places that might be an issue and uses stdlib memory functions. One more thing that could be done would be to mutex protect the marking but I don't believe that is needed as the GVL should handle that.
I'd like to question the usefulness of this cache though, isn't it redundant with rb_enc_interned_str
and rb_intern
?
Both of these are able to lookup existing objects with 0 allocation and a simple hash lookup. So I feel like this cache shouldn't be used with Ruby 3.0. Or am I missing something?
The cache gives a significant boost in performance over just calling the Ruby intern functions. There is an option not to use the cache and that is appropriate when loading something like a large dictionary just one time. The use of the Ruby intern functions are similar in that respect. Intern or cache is for multiple uses of the same string or symbol.
The cache gives a significant boost in performance over just calling the Ruby intern functions.
Do you have any specifics on this? Reading https://github.com/ohler55/oj/pull/695 I don't see actual numbers. How much faster are we talking?
Depending on the JSON I was seeing from 10 to 20%.
Thanks. Definitely not worth it for us, we'll make sure to turn that cache off.
Not everyone needs the performance boost that's why it is an option.
BTW, I do have some raw number comparing rb_enc_interned_str()
and the Oj cache. For strings the are about the same with less memory use by Oj. For symbols there is a significant difference as there results demonstrate (The code is ugly right now and in a branch on the Ox repo but I'll cleanup and provide at a later date. I figured since I had it I'd share it.)
intern cache 0: 749457 strings in 1.717807 seconds (436287.126687/sec)
intern cache 1: 749457 strings in 0.156665 seconds (4783830.888387/sec)
rb_intern 0: 749457 strings in 0.781864 seconds (958551.118986/sec)
rb_intern 1: 749457 strings in 0.843268 seconds (888752.625990/sec)
Note that memory use is the difference from before to after and not total use. The test runs once (0) to fill the cache or intern and then a second time (1) as if the system has been warmed up. Roughly 3/4 of a million unique strings are used.
Update: I don't trust the memory use numbers so they were removed. They were not consistent and changed depending on the order of the runs. The time are a bit different too. Here are some more accurate timing results for symbol keys.
oj cache 0: 749457 in 1.657558 seconds (452145.143959/sec)
oj cache 1: 749457 in 0.158449 seconds (4729966.344150/sec)
rb_intern 0: 749457 in 0.503076 seconds (1489749.543224/sec)
rb_intern 1: 749457 in 0.507033 seconds (1478121.710994/sec)
@ohler55 , above it says that the cache can be disabled with Oj.default_options = { cache_keys: false, cache_strings: 0 }
.
Is the second option cache_strings
or cache_str
?
On 3.13.3:
Oj.default_options.select{|k,_| k.to_s=~/cache_/}
# => {:cache_str=>0, :cache_keys=>true}
Likewise, setting :cache_str
works, while trying to set :cache_strings
seems like it's ignored?
Side note: what about adding a brief note to Options.md that setting cache_str/cache_strings to 0
disables that feature? Also, I'm guessing that cache_keys
affects keys regardless of length, and cache_strings
affects values only? Would it be worth clarifying that too?
Right you are, it is :cache_str
. I'll make both work and add more documentation.
There is a maximum length for cached keys and strings of 34 characters.
Pushed changes to the 'thread-protect' branch. http://www.ohler.com/oj/doc/file.Options.html updated as well.
I have one report that the thread-protect branch fixed the issue. Anyone else tried it out yet?
Released v3.13.4 for the fix.
Possibly related to code changed in https://github.com/ohler55/oj/pull/695
After we upgraded to 3.13.3 we started getting some new alerts from our infra:
The backtraces point back to
oj_calc_hash_key
We are using Ruby 2.7.2p137 (we are going to upgrade very shortly to 2.7.4)
Issue is quite extensive, we opted to downgrade to 3.13.2 now to see if it resolves per:
https://github.com/discourse/discourse/commit/0183d51070d33abc66c8681b43a5ce7571333b13
The segfault is the primary issue, you can see the scale here, usually it is no segfaults for weeks across hundreds of processes.
Will update here to confirm if 3.13.2 resolves the issue.