kragniz / python-etcd3

Python client for the etcd API v3
Apache License 2.0
430 stars 184 forks source link

Flaky CI #371

Open kragniz opened 6 years ago

kragniz commented 6 years ago

We're seeing this error message intermittently in CI, causing most jobs to fail:

subprocess.CalledProcessError: Command '['etcdctl', '-w', 'json', 'get', '/doot/put_1']' returned non-zero exit status -11
niconorsk commented 6 years ago

I've been looking at what flakiness is still there in the current state.

The main cause of failure these days seems to be the etcd process itself managed by pifpaf returning a non-zero exit code with all the tests passing. Not sure what the best approach is though.

One option is this in tox.ini:

- pifpaf -e PYTHON run etcd
py.test --cov=etcd3 --cov-report= --basetemp={envtmpdir} {posargs}
- kill -9 $PIFPAF_PID

This would ignore the etcd errors but does feel like we may brush potential problems under the carpet

Edit: Upon actual testing, the above doesn't actually work, because it doesn't properly clean up the etcd process. The pifpaf docs suggest eval to get the appropriate env set but that is not available within tox

jd commented 6 years ago

@niconorsk do you have a log file around that shows this flakyness?

As for the eval thing, I don't think it'd change anything. That being said, if you want to test this approach, you need an intermediate shell script indeed.

niconorsk commented 6 years ago

@jd Upon further testing, I am no longer so sure that it's the etcd process crashing. It's possible that its something going wrong during interpreter shutdown. This only seems to happen on Travis runs and not when running locally.

Here are some failures after I turned on debug logging: https://travis-ci.org/niconorsk/python-etcd3/jobs/404304316 https://travis-ci.org/niconorsk/python-etcd3/jobs/404304314

Some without debug logs on other PRs: https://travis-ci.org/kragniz/python-etcd3/jobs/403444461 https://travis-ci.org/kragniz/python-etcd3/jobs/402909233

After some poking around, I found this issue which I wonder is the culprit: https://github.com/grpc/grpc/issues/12531

I'm currently in the process of testing out this change that I believe will fix the problem: https://github.com/niconorsk/python-etcd3/commit/551055bbdbb72fd0c4b51cbe8953de73780e2eb9 15 or so runs in I have yet to reproduce the problem

I am now considering whether to add enter and exit methods that call close so that this can be done: with etcd3.Client() as client: client.do_stuff

jd commented 6 years ago

You seem to be on something @niconorsk. The process pifpaf launches (pytest) exits with an exit code of 245, which is -11, and as someone smart said on the Internet, that the signal value for SIGSEGV. So while I don't see any Segmentation fault being printed, it's possible that the subprocess.Popen().wait() that's used by pifpaf returns an exit code of -11 because pytest segfaults.

The cause of that segfaults could be the bug in gRPC you pointed. Anyway, I'd be pretty sure it'd be something in gRPC since that's the only lib doing nasty low-level C stuff.

niconorsk commented 6 years ago

@jd I have created https://github.com/kragniz/python-etcd3/pull/474 that seems to help this issue and seems like a good resource managemnt thing to do anyways