Linux amber does not get properly killed

hevrard commented 5 years ago

We observed several amber commands still running (and eating all memory), although only one should be running at a time. I suspect the python subprocess timeout does not properly kill the amber process in case of timeout.

We should double check that a real SIGKILL is sent to amber when it times out.

afd commented 5 years ago

@hevrard or @paulthomson , can one of you take this?

afd commented 5 years ago

Tentatively assigned @hevrard

hevrard commented 5 years ago

Thanks @paulthomson for the joined investigation!

It looks like python subprocess timeout does not mix well with catchsegv which spawns new processes, such that when python kills the subprocess due to a timeout, the top catchsegv process gets killed but its children (another catchsegv and amber) happily continue to run.

We can replace catchsegv by gdb -batch -ex run -ex backtrace --args amber ..., this will produce a stack trace when a segfault occurs; thanks Paul for this one-liner!

Now on a small experiment I'm afraid gdb adds a significant delay (+4 sec per run) probably due to loading symbols from amber, I'll try to see if we can bypass this issue.

hevrard commented 5 years ago

Running amber release/debug builds, with/without under gdb, on a single non-trivial shader leads to these results:

amber-release: 0.3 sec amber-debug: 0.7 sec gdb amber-release: 1.3 sec gdb amber-debug: 4.9 sec

Using gdb leads to a significant slowdown, 1 second in the best case.

I suggest we use core dump and retrieve the backtrace afterwards only in case of crash.

I'll wait for the PR tweaking runspv subprocess() usage to be merged before working on a fix.

hevrard commented 5 years ago

The gdb wrapper leads to a weird problem where, after a test timeouts, then the next test will be suspended as if it received SIGSTOP (equivalent to Ctrl-Z in a terminal):

#### Image job: variant_100
Exec:['/gpu/graphicsfuzz/graphicsfuzz/target/graphicsfuzz/python/drivers/../../bin/Linux/glslangValidator', '-V', 'variant_100.frag', '-o', '/gpu/release-1-1/host-amber/variant_100.frag.spv']
RETURNCODE: 0
Exec:['/gpu/graphicsfuzz/graphicsfuzz/target/graphicsfuzz/python/drivers/../../bin/Linux/spirv-dis', '/gpu/release-1-1/host-amber/variant_100.frag.spv']
RETURNCODE: 0
Exec (verbose):['/usr/bin/gdb', '-return-child-result', '-batch-silent', '-ex', 'run', '-ex', 'backtrace', '-ex', 'set confirm off', '-ex', 'quit', '--args', '/usr/local/google/home/hevrard/bin/amber', '-i', '/gpu/release-1-1/host-amber/image_0.png', '/gpu/release-1-1/host-amber/tmpscript.shader_test']
STDOUT:

STDERR:

STATUS TIMEOUT

Send back, results status: 60
No job
#### Image job: variant_101
Exec:['/gpu/graphicsfuzz/graphicsfuzz/target/graphicsfuzz/python/drivers/../../bin/Linux/glslangValidator', '-V', 'variant_101.frag', '-o', '/gpu/release-1-1/host-amber/variant_101.frag.spv']
RETURNCODE: 0
Exec:['/gpu/graphicsfuzz/graphicsfuzz/target/graphicsfuzz/python/drivers/../../bin/Linux/spirv-dis', '/gpu/release-1-1/host-amber/variant_101.frag.spv']
RETURNCODE: 0
Exec (verbose):['/usr/bin/gdb', '-return-child-result', '-batch-silent', '-ex', 'run', '-ex', 'backtrace', '-ex', 'set confirm off', '-ex', 'quit', '--args', '/usr/local/google/home/hevrard/bin/amber', '-i', '/gpu/release-1-1/host-amber/image_0.png', '/gpu/release-1-1/host-amber/tmpscript.shader_test']

[1]+  Stopped                 glsl-to-spv-worker host-amber host

I suggest we drop support of stack traces on Linux for this release.

hevrard commented 5 years ago

See #351 for a proper solution to this, where we keep using catchsegv but take extra care to kill it and all its children in case of timeout.

google / graphicsfuzz

Linux amber does not get properly killed #338