Open jonathanmetzman opened 4 months ago
Two that have been doing their tasks for weeks are stuck here: Testing for crash
I've looked at two instances suffering from this issue. Their stacktraces looked roughly similar:
(gdb) bt
#0 0x00005cbe7d5d4b34 in sre_ucs1_match (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7c014, toplevel=toplevel@entry=0) at ./Modules/sre_lib.h:590
#1 0x00005cbe7d5db30d in sre_ucs1_search (pattern=<optimized out>, state=0x7ffee26f5bd0) at ./Modules/sre_lib.h:1443
#2 sre_search (state=state@entry=0x7ffee26f5bd0, pattern=pattern@entry=0x5cbe7fc7bfe8) at ./Modules/_sre.c:578
#3 0x00005cbe7d5dd414 in pattern_subx (self=self@entry=0x5cbe7fc7bf90, ptemplate=<optimized out>, string=0x5cbe840925b0, count=0, subn=subn@entry=0) at ./Modules/_sre.c:1060
#4 0x00005cbe7d5ddbf5 in _sre_SRE_Pattern_sub_impl (count=<optimized out>, string=<optimized out>, repl=<optimized out>, self=0x5cbe7fc7bf90) at ./Modules/_sre.c:1181
#5 _sre_SRE_Pattern_sub (self=0x5cbe7fc7bf90, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at ./Modules/clinic/_sre.c.h:416
#6 0x00005cbe7d4b9ea7 in _PyMethodDef_RawFastCallKeywords (method=0x5cbe7d741320 <pattern_methods+96>, self=self@entry=0x5cbe7fc7bf90, args=args@entry=0x7c43d107cf10, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at Objects/call.c:660
#7 0x00005cbe7d6346be in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7c43d311a550, args=0x7c43d107cf08, nargs=nargs@entry=4, kwnames=kwnames@entry=0x0) at Objects/descrobject.c:288
#8 0x00005cbe7d4a01b2 in call_function (pp_stack=pp_stack@entry=0x7ffee26f5e90, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4593
#9 0x00005cbe7d4a119c in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3110
#10 0x00005cbe7d568560 in PyEval_EvalFrameEx (throwflag=0, f=0x7c43d107cd70) at Python/ceval.c:547
This seems like Python is stuck in some infinite regex
Got a stacktrace from a third:
Traceback (most recent call last):
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 249, in <module>
main()
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 212, in main
error_stacktrace, clean_exit, task_payload = task_loop()
File "/mnt/scratch0/clusterfuzz/src/python/bot/startup/run_bot.py", line 146, in task_loop
commands.process_command(task)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 249, in process_command
task.high_end, task.is_command_override)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 159, in wrapper
return func(task_name, task_argument, job_name, *args, **kwargs)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 431, in process_command_impl
preprocess)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/commands.py", line 218, in run_command
result = task.execute(task_argument, job_name, uworker_env)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 127, in execute
self.execute_locally(task_argument, job_type, uworker_env)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/task_types.py", line 63, in execute_locally
uworker_output = utasks.uworker_main_no_io(self.module, uworker_input)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py", line 194, in uworker_main_no_io
uworker_output = utask_module.utask_main(uworker_input)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 368, in utask_main
fuzz_target, testcase, testcase_file_path, test_timeout)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/tasks/utasks/analyze_task.py", line 197, in test_for_crash_with_retries
compare_crash=False)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 801, in test_for_crash_with_retries
testcase.flaky_stack)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 688, in reproduce_with_retries
state = self._get_crash_state(round_number, crash_result)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/bot/testcase_manager.py", line 664, in _get_crash_state
state = crash_result.get_symbolized_data()
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/crash_result.py", line 48, in get_symbolized_data
self.output, symbolize_flag=True)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/_internal/crash_analysis/stack_parsing/stack_analyzer.py", line 113, in get_crash_data
result = stack_parser.parse(crash_stacktrace_without_inlines)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 472, in parse
self.match_assert(line, state, ASSERT_REGEX_GLIBC_SUFFIXED)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 322, in match_assert
regex, line, state, new_type='ASSERT', new_frame_count=1)
File "/mnt/scratch0/clusterfuzz/src/clusterfuzz/stacktraces/__init__.py", line 184, in update_state_on_match
match = compiled_regex.match(line)
KeyboardInterrupt
root@clusterfuzz-
I think a few hundred bots are blocked by this.
Another thing we should have is better task killing. A bot should kill a process that runs a task for too long. @oliverchang Do you know if we have this feature already? I think we do right?
Another lesson we can learn is to have alerts when queue backlogs get too long. This is both a symptom of problems and an issue in of itself.
This has been mitigated in ClusterFuzz and fuzztest no longer does this.
But we should still deal with backlogs better.
There is a 100k unacked message backlog. Interestingly, many of these tasks appear taken but not finished.