oceanprotocol / operator-engine

Python library allowing to interact with the Kubernetes infrastructure
Apache License 2.0
7 stars 14 forks source link

Algorithm job OOMKilled will forever deserted and block further new job from the same wallet #78

Closed soonhuat closed 5 months ago

soonhuat commented 1 year ago

Describe the bug As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace. Associated configmap and other kubernetes objects also not removed. Database table jobs, column status will forever stays as 40: Running algorithm Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.

To Reproduce Steps to reproduce the behavior:

  1. Setup operator-engine with with env var configure to be nCPU: 1 and ramGB: 1
  2. Publish algorithm and dataset that will run more than 10min and progressively use extra memory
  3. Order and start the compute job

Expected behavior Jobs pod killed gracefully and next subsequent job will able to be run.

Screenshots Running pod

image

Configmap

image

Database jobs

image
LoznianuAnamaria commented 1 year ago

Sorry for the late reply. This is on our radar and we will do it with C2D V2(Upgrades, fixes and refactor)