Create a pytest plugin that is the official methodology for managing resource lifecycles

gsmethells commented 3 years ago

What's the problem this feature will solve?

Partitioned off of https://github.com/pytest-dev/pytest/issues/5243

When a fixture creates an external resource, this plugin would provide a mechanism to handle gracefully tearing down that resource when the process terminates due to a SIGTERM so that there is no resource leak bugs in a user's test suite. An official pytest subproject would save everyone from re-inventing the wheel to deal with such a common bug.

Describe the solution you'd like

A method to register a resource manager instance (perhaps using a Command design pattern) within a fixture.

A suggested solution might go like this:

There is a resource manager that is an abstraction requiring definitions by the user for how to setup and how to teardown the resource it will manage. Instances of the resource manager can be used in fixtures to easily setup, yield, and teardown resource -- and, in addition, will gracefully teardown any resources when a SIGTERM is sent to the process.

When the fixture is invoked:

the resource manager instance is automagically registered with the SIGTERM handler
the resource manager instance is used to setup its resource

When the fixture returns from yielding:

the resource manager instance is used to teardown its resource
the resource manager instance is automagically unregistered from the SIGTERM handler

When a SIGTERM is caught:

each resource manager instance registered with the SIGTERM handler is used to teardown its resource

Race Conditions and other issues are left to the implementor.

Alternative Solutions

There are no supported alternatives. Especially none that would scale out to N resources. Nor any that would follow best practices for terminating processes gracefully.

Additional context

The following proposal is not considered suitable by pytest developers:

https://github.com/pytest-dev/pytest/issues/5243

gsmethells commented 3 years ago

cc @simonfontana

asottile commented 3 years ago

here's a sketch for the minimum that's needed: https://github.com/pytest-dev/pytest/issues/5243#issuecomment-491522595 -- I don't believe this is possible in a cross-platform manner, especially on windows

gsmethells commented 3 years ago

I am working on a solution for a large test suite that leverages Kubernetes runners in GitLab CI. So far, I will just note that the suggested solution has not worked out for us and I am investigating as to why.

The fixtures in place to open GCS resources leave a per spawned job bucket that ought to be cleaned up given that pod deletion sends a SIGTERM according to Google docs on the matter. More later.

RonnyPfannschmidt commented 3 years ago

@gsmethells is the pytest procss by chance the containers init process?

RonnyPfannschmidt commented 3 years ago

@gsmethells pytest is under no circumstances suitable as a container PID 0

gsmethells commented 3 years ago

@RonnyPfannschmidt I don't believe so, but I'll check. We use https://docs.gitlab.com/runner/executors/kubernetes.html

gsmethells commented 3 years ago

Nope:

pid is 22
============================= test session starts ==============================

gsmethells commented 3 years ago

conftest.py:

@pytest.fixture(scope = 'session', autouse = True)
def termHandler():
  from signal import getsignal, signal, SIGTERM, SIGINT

  orig = signal(SIGTERM, getsignal(SIGINT))

  yield

  signal(SIGTERM, orig)

Snippet of gcstorage_test.py (the sleep() calls are just so I have a chance to cancel the job run manually if desired):

...

@pytest.fixture
def testBucket(request, bucketName, adminGCS):
  print(f'creating {bucketName}')
  adminGCS.createBucket(bucketName)
  print(f'created {bucketName}')
  sleep(3)

  yield bucketName

  def deleteBucket():
    print(f'about to delete {bucketName}')
    sleep(3)
    print(f'deleting {bucketName}')
    adminGCS.deleteBucket(bucketName)
    print(f'deleted {bucketName}')

  request.addfinalizer(deleteBucket)

@pytest.fixture
def testObj(request, bucketName, gcs):
  objName = 'foo'

  print(f'creating {objName}')
  gcs.putObject(bucketName, objName, 'contents')
  print(f'created {objName}')
  sleep(3)

  yield objName

  def deleteObject():
    print(f'about to delete {objName}')
    sleep(3)
    print(f'deleting {objName}')
    gcs.deleteObject(bucketName, objName)
    print(f'deleted {objName}')

  request.addfinalizer(deleteObject)

...

@pytest.mark.integration
def test_copyObjectSuccess(gcs, testBucket, testObj):
  objName = 'copyObjectSuccess'

  try:
    with pytest.raises(IOError):
      gcs.headObject(testBucket, objName)

    gcs.copyObject(testBucket, testObj, testBucket, objName)

    header = gcs.headObject(testBucket, objName)

    assert header.md5 == '98bf7d8c15784f0a3d63204441e1e2aa'
  finally:
    gcs.deleteObject(testBucket, objName)

I see the truncated output when the job is canceled (either manually or due to a new commit being pushed which cancels a previous pipeline and all its jobs ... I've checked out both scenarios). This is from GitLab's job console when I hit the cancel button while the bucket had been created but not yet torn down:

and in the Google console, I can see the bucket being left behind (the numeric prefix to the bucket name is the job ID, so that jobs do not step on each other's toes ... therefore, a 1000 canceled jobs, perhaps due to pushing new commits, means a 1000 leaked buckets 😬):

Even from a year ago, GitLab says its Kubernetes runner already had graceful termination, when I was bringing up the concept of graceful termination with their team:

https://gitlab.com/gitlab-org/gitlab-runner/-/merge_requests/987#note_272629038

For the Kubernetes executor we already have graceful termination because in the delete we don't set a timeout in DeleteOptions and looking at the documentation if left nil it would terminate using the default values.

The runner doesn't use anything, it just deletes the pod. At that point it's up to the Kubernetes implementation for what happens when a delete pod is performed and Google says it uses SIGTERM. This job ran in a newly spawned container in a newly spawned pod for this commit's CI pipeline. If SIGTERM is sent, then I would expect termHandler() to be called, but it apparently is not or it is but it's not functioning how it is believed it ought to function.

Any help is appreciated. I'll link in GitLab folks as well, if it is useful.

asottile commented 3 years ago

my guess is gitlab is brutally killing the container rather than SIGTERM, or the root of your process tree is not acting as a proper init system and forwarding the signal along

I've verified the fixture I posted earlier with this simple script:

import signal

import pytest

@pytest.fixture(scope='session', autouse=True)
def term_handler():
    orig = signal.signal(signal.SIGTERM, signal.getsignal(signal.SIGINT))
    yield
    signal.signal(signal.SIGTERM, orig)

@pytest.fixture
def fix():
    print('setup!')
    yield
    print('teardown!')

def test(fix):
    import time
    time.sleep(10000)

gsmethells commented 2 years ago

Just to close the loop on previous comments: this appears to be a known issue with the Kubernetes gitlab-runner project.

https://gitlab.com/gitlab-org/gitlab-runner/-/issues/28162

In any case, I believe the above fixture would be useful to broadcast as a documented suggested workaround. Just my two cents. Thanks.

jxramos commented 2 years ago

Beyond resource management there is actually another angle to handling sigterms, and this is in the context of bazel as a test runner to pytest via a py_test target. Bazel keeps tabs on execution time and forces test targets to terminate if they exceed a default or specified time limit--and they do so with sigterms: https://github.com/bazelbuild/bazel/blob/c5ec07ba3f2184c99b6468d905177cc20ecce00a/tools/test/test-setup.sh#L273

If a test execution should exceed that timeout in bazel the experience is pretty lacking, it seems like you only get truncated console output from pytest rather than the full orderly log and standard output capture sections and all that is normally presented on test error or failure. It would be amazing if there was some orderly way to wrap up a pytest session and flush whatever information was retained up to that point. I'm curious if I define a signal handler and call pytest.exit() if I'd get lucky and have the expected console output rendered to give me a little bit more diagnostics into what's going on with the test case when things timed out.

asottile commented 2 years ago

iirc you can control which signal bazel sends through a test wrapper -- at stripe we used int then term then kill

RonnyPfannschmidt commented 2 years ago

im incline to close this issue as the particular key here is incorrect termination from the outside, we might want to add a doc section on correct termination setup for "bad environments"

asottile commented 2 years ago

what I would suggest is document this: https://github.com/pytest-dev/pytest/issues/9142#issuecomment-932623158 and then maybe link a blurb about container init systems (dumb-init has a decent blog post on this) or signal changers (dumb-init also does this)

jxramos commented 2 years ago

Couldn't find anything documenting bazel signal redirecting behavior. Some of the guys at my work couldn't locate any undocumented features matching this behavior either, but we are a few major versions behind the latest release so maybe this functionality came later.

The dumb-init stuff may have an avenue to do this. Ultimately I couldn't get the signal rerouting handler above to make pytest trigger a keyboard interrupt when bazel timed out the test. It did seem to do the rerouting when executed directly from pytest however. This could be due to bazel wrapping everything in a test.sh runner file rather than a direct python child process or whatever is going on.

What I'm going to wind up leaning on instead is not to have pytest flush everything I'd want it to but to instead lean on the --log-file features to have an artifact left on the machine to diagnose where the timeout budget is being eaten up. That's a surefire way to diagnose and should give us what we need and which happened to operate as expected under my artificial timeout experiments.

asottile commented 2 years ago

iirc bazel by default is very aggressive and uses SIGKILL (uncatchable)

a-recknagel commented 1 year ago

fyi, pycharm also sends a SIGTERM when pressing the red square button to stop a running process. I do this a lot when debugging because it's faster then searching for F9 and/or waiting for the test to finish. A plugin that runs fixture cleanup for SIGTERM would be nice.

What's the recommendation? Just paste the term_handler snippet into our test suite?

pytest-dev / pytest