Open Gkrumbach07 opened 2 years ago
/assign @fridex /priority important-soon
@Gkrumbach07 is this something still happening?
@Gkrumbach07 is this something still happening?
I havent seen it in awhile, but I have been avoiding large runs. Is related to https://github.com/thoth-station/adviser/issues/1525 and https://github.com/thoth-station/adviser/pull/2204
If the issue above is resolved then this issue can be resolved.
I was thinking about a solution for this. We could add inter-process communication that would guarantee that stack_info
is propagated correctly to the parent process once the child process is killed on OOM. I don't think that's the right way to do these type of things though (moreover we can have results computed even if OOM happens). I think the underlying platform (OpenShift) should provide us a way to be notified when a pod is going to be killed. See https://github.com/thoth-station/adviser/issues/1525. We had discussions with Vasek a while back about this implementation, not sure if it continued and if OpenShift team is planning to add such support. It is not just our use case, but any memory expensive workloads run on Kubernetes/OpenShift could benefit from such a feature.
@goern Do you know people to reach out to support this feature?
dont we see 137 errors on OOM'd pods? like
State: Running
Started: Thu, 10 Oct 2019 11:14:13 +0200
Last State: Terminated
Reason: OOMKilled
so we can check for this in the argo workflow?
The problem here are users that will not get any recommendations if OOM is done. What we could do - if there is a risk of OOM, we could stop the resolution process and show users results that we have computed so far.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
/lifecycle rotten
/sig stack-guidance
Given the memory optimizer implementation and the default values used in adviser deployments as a limit above which to run the optimizer (5.75GB < 6GB currently allocated), this issue should normally be solved, unless a "burst" in memory usage happens in between batches of resolver iterations, with a batch size defined by the THOTH_ADVISER_MEM_OPTIMIZER_ITERATION
environment variable in deployments (unlikely with the current value of 100 which is small).
Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?
cc @goern @harshad16
Following a discussion with @harshad16, a way to improve the limit would be to
In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly
might be related : https://github.com/thoth-station/metrics-exporter/issues/725
We could add inter-process communication that would guarantee that
stack_info
is propagated correctly to the parent process once the child process is killed on OOM.
I'm pretty sure this is impossible. The oom (either from cgroup or the kernel) send a SIGKILL, which cannot be handled in any way by the process itself.
However we might check stuff like resource.setrlimit
then catching MemoryError ?
Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?
Liveness probe won't help, it's more for "blocked" kind of problem
In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly
I think the prometheus node-exporter exposes the containers memory usage ? Not sure if it's included on Openshift ?
I think we should turn this into an operational issue: let's observe and alarm, then adjust.
if we can see how many pods get oom killed, at which memory limit, we should be able to manually adjust the limit. using dependency monkey to generate a larger set of observations. having statistics about the frequency of ook kills in relation to the memory consumption should give us a good priority for further actions/implementation/improvements.
if we see a oom killed advise, can we adjust the memory limit and rerun it?
I think we should turn this into an operational issue: let's observe and alarm, then adjust. +1
if we see a oom killed advise, can we adjust the memory limit and rerun it?
Sure, if we are the producer (aka talking to the k8s api) of the pod
resource or it's parent resource(workflow, job, whatever). (In that case
we should probably have a RestartPolicy: Never
to handle failures
ourselves)
So we have three separate things:
From what I understood @goern , we would do 1 to see if we need 2 and/or 3 ?
if we see a oom killed advise, can we adjust the memory limit and rerun it?
Sure, if we are the producer (aka talking to the k8s api) of the pod resource or it's parent resource(workflow, job, whatever). (In that case we should probably have a
RestartPolicy: Never
to handle failures ourselves)
To clarify a bit of my earlier statement, technically we would not rerun it, but create another one (pods/parent resource) with adjusted resources. But it's functionally equivalent
Describe the bug When i run an advise report on a large stack, I get an expected OOM error. However there is no
stack_info
returned. instead report is null and I only get an error message.To Reproduce https://khemenu.thoth-station.ninja/api/v1/advise/python/adviser-211209191901-3de39ce57cac2f07
Expected behavior To see a populated
stack_info
object under report in the document.Additional context This may or may not be a bug, but it would be helpful to see how the report errored out in more details besides looking at the logs.