operate-first / support

This repo should serve as a central source for users to raise issues/questions/requests for Operate First.
GNU General Public License v3.0
15 stars 25 forks source link

seldon deployment failing, pods getting OOM killed #598

Closed oindrillac closed 2 years ago

oindrillac commented 2 years ago

Describe the Problem

Our seldon deployment is failing without any descriptive log at the model de-serialization step.

Tested this locally and on jupyterhub, this works.

On the ds-ml-worklows-ws namespace, the pods error out and get OOMkilled after downloading the model.

Wonder if this happening because of the memory limit on the namespace? Can we increase the limit?

cc: @chauhankaranraj @suppathak

Steps to Reproduce

  1. Go to Seldon Operator, create deployment from config
  2. Go to created ttm-model-test2-ttm-model-test2-predictor-ttm-model-test2-clf pod, and the pod fails with OOM kill error.

Expected behaviour

Pod should have spun up successfully. Model deployment should return predictions as expected.

Screenshots

image

image

Additional context

related: https://github.com/open-services-group/community/issues/174

HumairAK commented 2 years ago

resource cap for this namespace can be updated via: https://github.com/operate-first/apps/blob/master/cluster-scope/base/core/namespaces/ds-ml-workflows-ws/resourcequota.yaml

oindrillac commented 2 years ago

the fix was to allocate more resources to the pod by increasing resources in the seldon deployment config