Pulumi operator process gets killed after running pip install

JonCholas commented 1 year ago

What happened?

Um pushing a stack to pulumi operator, bit it always fails after doing a pip install with a signal: killed although the dependencies get installed as when it runs again and again, pip shows that the dependencies are cached

Pip install command

{
  "insertId": "8li49cy20nfwvele",
  "jsonPayload": {
    "Args": [
      "../../venv/bin/python",
      "-m",
      "pip",
      "install",
      "-r",
      "requirements.txt"
    ],
    "Request.Name": "deployment-beta-beta",
    "Dir": "/tmp/pulumi-working/beta-neara-f515a5a8/deployment-beta-beta/pulumi/projects/gcp-neara-app",
    "ts": "2023-04-28T04:27:27.411Z",
    "Request.Namespace": "beta-neara-f515a5a8",
    "Path": "../../venv/bin/python",
    "msg": "Pip Install",
    "Stdout": "Installing collected packages: types-PyYAML, arpeggio, urllib3, typing-extensions, tomli, six, semver, pyyaml, pyflakes, pycodestyle, protobuf, platformdirs, pathspec, packaging, mypy-extensions, mccabe, isort, idna, grpcio, dill, click, charset-normalizer, certifi, attrs, requests, pulumi, parver, mypy, flake8, black, pulumi-random, pulumi-kubernetes, pulumi-gcp, pulumi-aws, pulumi-eks, pulumi-neara",
    "logger": "controller_stack",
    "level": "debug"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "location": "us-west1",
      "namespace_name": "beta-neara-f515a5a8",
      "project_id": "customer-sce-beta-to40kc",
      "container_name": "pulumi-kubernetes-operator",
      "cluster_name": "neara-3d6f0d0",
      "pod_name": "beta-pulumi-operator-0606f596-6599f8bb94-6lrqp"
    }
  },
  "timestamp": "2023-04-28T04:27:27.412081049Z",
  "severity": "DEBUG",
  "labels": {
    "k8s-pod/service_istio_io/canonical-revision": "latest",
    "compute.googleapis.com/resource_name": "gke-neara-3d6f0d0-neara-spot-08f9e96-d753d3c7-fhnj",
    "k8s-pod/name": "pulumi-kubernetes-operator",
    "k8s-pod/service_istio_io/canonical-name": "beta-pulumi-operator-0606f596",
    "k8s-pod/security_istio_io/tlsMode": "istio",
    "k8s-pod/pod-template-hash": "6599f8bb94"
  },
  "logName": "projects/customer-sce-beta-to40kc/logs/stderr",
  "receiveTimestamp": "2023-04-28T04:27:29.982596615Z"
}

Next log message

{
  "insertId": "nakws6p7r41n1rjz",
  "jsonPayload": {
    "Request.Namespace": "beta-neara-f515a5a8",
    "msg": "Failed to setup Pulumi workdir",
    "Request.Name": "deployment-beta-beta",
    "ts": "2023-04-28T04:27:46.377Z",
    "level": "error",
    "Stack.Name": "gcp-neara-app__us-west1__sce-beta-beta",
    "error": "installing project dependencies: signal: killed",
    "logger": "controller_stack",
    "stacktrace": "sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.0/pkg/internal/controller/controller.go:214"
  },
  "resource": {
    "type": "k8s_container",
    "labels": {
      "namespace_name": "beta-neara-f515a5a8",
      "container_name": "pulumi-kubernetes-operator",
      "pod_name": "beta-pulumi-operator-0606f596-6599f8bb94-6lrqp",
      "cluster_name": "neara-3d6f0d0",
      "project_id": "customer-sce-beta-to40kc",
      "location": "us-west1"
    }
  },
  "timestamp": "2023-04-28T04:27:46.377844302Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/pod-template-hash": "6599f8bb94",
    "k8s-pod/service_istio_io/canonical-name": "beta-pulumi-operator-0606f596",
    "compute.googleapis.com/resource_name": "gke-neara-3d6f0d0-neara-spot-08f9e96-d753d3c7-fhnj",
    "k8s-pod/service_istio_io/canonical-revision": "latest",
    "k8s-pod/name": "pulumi-kubernetes-operator",
    "k8s-pod/security_istio_io/tlsMode": "istio"
  },
  "logName": "projects/customer-sce-beta-to40kc/logs/stderr",
  "receiveTimestamp": "2023-04-28T04:27:50.020413977Z"
}

Expected Behavior

The workdir is setup correctly and pulumi up runs without problems

Steps to reproduce

Push a stack in python that has installation with pip

Output of `pulumi about`

pulumi about CLI
Version 3.64.0 Go Version go1.20.3 Go Compiler gc

Host
OS debian Version 11.6 Arch x86_64

Pulumi locates its logs in /tmp by default warning: Failed to read project: no Pulumi.yaml project file found (searching upwards from /). If you have not created a project yet, use pulumi new to do so: no project file found warning: A new version of Pulumi is available. To upgrade from version '3.64.0' to '3.65.1', visit https://pulumi.com/docs/reference/install/ for manual instructions and release notes.

Additional context

I have seen this before but it was retrying and then succeeding. Not sure what has changed that now it always fails. Im running this in a GKE cluster

Contributing

Vote on this issue by adding a 👍 reaction. To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

rquitales commented 1 year ago

@JonCholas Thanks for reporting this issue. Would you also be able to provide the configuration and specs of your GKE cluster and nodes? Could you also provide a sample of your Python stack, or describe which resources you are using.

It appears that the pip install process is likely being killed as there isn't enough memory and swap for it to install all the required dependencies. I've tried a simple minimal Python stack that creates a GCP Storage Bucket, and am not seeing any signal killed issues.

JonCholas commented 1 year ago

I did try giving the pod more resources and it worked. The odd thing is that the graphs on consumed resources weren't showing any spike, so I'm guessing the memory consumption rump up is too fast to show anything.

Is there a way to improve the operator to show more details on why the process was killed?

mikhailshilkov commented 1 year ago

@rquitales Could you please adjust the labels and respond to Jon again? Thank you.

rquitales commented 1 year ago

@JonCholas Thanks for the update, and apologies for the delayed response on this. There currently isn't a way to show more details on why this process is killed.

The kill signal is coming from pip itself (see https://stackoverflow.com/questions/43245196/pip-install-killed). We shell out to pip to install the required Python dependencies, and return any errors that may occur. Unfortunately, the error message returned from pip isn't descriptive, so we can only deduce that the issue is due to resource starvation based on context.

We can try to improve logging within our operator with more info messages, but I'm not sure if we can provide more descriptive errors for pip being killed.

rquitales commented 1 year ago

As next steps, I'll look into possible ways to improve the current logging we have to better surface this issue instead of having a simple "signal killed" message.

JonCholas commented 1 year ago

@rquitales sounds good, I think even a FAQ to troubleshooting doc about comment problems with the operator will fix it if there is no more details from pip

jeduden commented 4 months ago

how about logging stdout + stderr of the pip install command to the log? in general, the log output and status update of the operator should contain enough information to point the user of the controller to the cause of the error.

i have currently the same issue. I have been able to exec into the operator container. Running pip install from the commandline works and exists with error code 0.

In my case, the operator's container hasn't been restarted. There are also no memory limits set on the container.

jeduden commented 4 months ago

Update: I was able to troubleshoot the issue by redeploying the operator with --zap-level=debug. The stdout/stdout of the pip install execution are then printed in the logs of the operator. It turns out the requirements.txt wasn't found. However, I checked the local workspace on the operator's volume. The requirements.txt file is present.

pulumi / pulumi-kubernetes-operator