marxin / cvise

Super-parallel Python port of the C-Reduce
Other
219 stars 25 forks source link

cvise stops intermittently #113

Open avikivity opened 1 year ago

avikivity commented 1 year ago

Occasionally I see

[1]+  Stopped                 cvise --clang-delta-std=c++20 --print-diff ./check.sh sstable_datafile_test.cc

And I have to restart the job with fg. This is of course problematic for unattended runs.

The interestingness test runs gdb in batch mode (gdb also runs the program). Perhaps gdb signals interfere with cvise?

avikivity commented 1 year ago

A workaround is to send SIGCONT in a loop from some shell script, but I'd like to understand and fix it.

marxin commented 1 year ago

Hmm, I haven't seen this behavior during my cvise use. So you say that the master process (cvise ...) got moved to the background and so that it's stopped? All the interestingness tests are run in a separate sub-process (using Pebble) library and it should not interact with the master process at all. Can you get a Python back-trace of the master process when it gets moved to the background? Does it really happen only if the int. test uses gdb?

avikivity commented 1 year ago

I don't know if it was moved into the background, or if something else happened.

I don't use cvise very often (but when I do, it's for multi-day reductions), so I can't tell if it's related to gdb or not. It seems likely since gdb plays with signals.

I don't know how to generate a Python backtrace (and if it's stopped, I'm sure I won't get once).

I guess a workaround is to package the interestingness test into a container, this should isolate any signals leakage. Still, it would be nice if cvise protected itself from this.

marxin commented 1 year ago

I don't know if it was moved into the background, or if something else happened.

Well the described behavior seems pretty unusual.

I don't use cvise very often (but when I do, it's for multi-day reductions), so I can't tell if it's related to gdb or not. It seems likely since gdb plays with signals.

Anyway, can you please attach a reproduces I can run locally a try to reproduce it?

I guess a workaround is to package the interestingness test into a container, this should isolate any signals leakage. Still, it would be nice if cvise protected itself from this.

Well, that sounds like a solution, but C-Vise should not behaved like you described ;)

avikivity commented 1 year ago

https://github.com/avikivity/scylladb/commits/bug-13730-investigation

Steps to reproduce:

  1. clone the repo into a Fedora 38 installation (or anything with clang 16 + all the dependencies)
  2. run ./cvise.sh
  3. wait for long, long hours

There's a container image with all the dependencies: docker.io/scylladb/scylla-toolchain:fedora-38-20230517 However, I did not try reproducing within the container, only on my Fedora 38 host. Note you'll need to run the container as --privileged since ptrace isn't available otherwise.

The problem reproduces rarely. I have a feeling it happens when the pass changes, but it hasn't happened enough times, and usually I wasn't looking when it did.

marxin commented 1 year ago

Thanks for the reproducer. Note I'm changing a job right now and I will get to it in one month from now when I'll have a reasonable powerful machine to reproduce it on. Hope it's fine?

marxin commented 1 year ago

All right, so I've changed a job and got a reasonably fast desktop machine.

Looking at your reproducer: can you please create a container (the provided one docker.io/scylladb/scylla-toolchain:fedora-38-20230517 seems to be unavailable), add there your git branch, and provide me a link, thanks! Note it seems one needs to have built things like /home/avi/scylla/build/release/seastar/libseastar.a (and probable others) in order to link the sstable_datafile_test_g binary.