Describe the bug
As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace.
Associated configmap and other kubernetes objects also not removed.
Database table jobs, column status will forever stays as 40: Running algorithm
Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.
To Reproduce
Steps to reproduce the behavior:
Setup operator-engine with with env var configure to be nCPU: 1 and ramGB: 1
Publish algorithm and dataset that will run more than 10min and progressively use extra memory
Order and start the compute job
Expected behavior
Jobs pod killed gracefully and next subsequent job will able to be run.
Describe the bug As an algorithm jobs OOMKilled because allocated memory were not enough for the whole process, the algorithm job will be deserted in the namespace. Associated configmap and other kubernetes objects also not removed. Database table jobs, column status will forever stays as 40: Running algorithm Thus, when operator-engine trigger db function announce_and_get_sql_pending_jobs, will always return the OOMKilled algorithm job, and no new job from this wallet can be started.
To Reproduce Steps to reproduce the behavior:
Expected behavior Jobs pod killed gracefully and next subsequent job will able to be run.
Screenshots
Running pod
Configmap
Database jobs