seldon deployment failing, pods getting OOM killed

oindrillac commented 2 years ago

Our seldon deployment is failing without any descriptive log at the model de-serialization step.

Tested this locally and on jupyterhub, this works.

On the ds-ml-worklows-ws namespace, the pods error out and get OOMkilled after downloading the model.

Wonder if this happening because of the memory limit on the namespace? Can we increase the limit?

cc: @chauhankaranraj @suppathak

Go to Seldon Operator, create deployment from config
Go to created ttm-model-test2-ttm-model-test2-predictor-ttm-model-test2-clf pod, and the pod fails with OOM kill error.

Pod should have spun up successfully. Model deployment should return predictions as expected.

HumairAK commented 2 years ago

oindrillac commented 2 years ago

the fix was to allocate more resources to the pod by increasing resources in the seldon deployment config

operate-first / support