Advise does not give a stack_info when OOM is thrown

Gkrumbach07 commented 2 years ago

Describe the bug When i run an advise report on a large stack, I get an expected OOM error. However there is no stack_info returned. instead report is null and I only get an error message.

To Reproduce https://khemenu.thoth-station.ninja/api/v1/advise/python/adviser-211209191901-3de39ce57cac2f07

Expected behavior To see a populated stack_info object under report in the document.

Additional context This may or may not be a bug, but it would be helpful to see how the report errored out in more details besides looking at the logs.

goern commented 2 years ago

/assign @fridex /priority important-soon

goern commented 2 years ago

@Gkrumbach07 is this something still happening?

Gkrumbach07 commented 2 years ago

@Gkrumbach07 is this something still happening?

I havent seen it in awhile, but I have been avoiding large runs. Is related to https://github.com/thoth-station/adviser/issues/1525 and https://github.com/thoth-station/adviser/pull/2204

If the issue above is resolved then this issue can be resolved.

fridex commented 2 years ago

I was thinking about a solution for this. We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM. I don't think that's the right way to do these type of things though (moreover we can have results computed even if OOM happens). I think the underlying platform (OpenShift) should provide us a way to be notified when a pod is going to be killed. See https://github.com/thoth-station/adviser/issues/1525. We had discussions with Vasek a while back about this implementation, not sure if it continued and if OpenShift team is planning to add such support. It is not just our use case, but any memory expensive workloads run on Kubernetes/OpenShift could benefit from such a feature.

@goern Do you know people to reach out to support this feature?

goern commented 2 years ago

dont we see 137 errors on OOM'd pods? like

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled

so we can check for this in the argo workflow?

fridex commented 2 years ago

The problem here are users that will not get any recommendations if OOM is done. What we could do - if there is a risk of OOM, we could stop the resolution process and show users results that we have computed so far.

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

sesheta commented 2 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

mayaCostantini commented 2 years ago

/sig stack-guidance

mayaCostantini commented 2 years ago

Given the memory optimizer implementation and the default values used in adviser deployments as a limit above which to run the optimizer (5.75GB < 6GB currently allocated), this issue should normally be solved, unless a "burst" in memory usage happens in between batches of resolver iterations, with a batch size defined by the THOTH_ADVISER_MEM_OPTIMIZER_ITERATION environment variable in deployments (unlikely with the current value of 100 which is small).

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

cc @goern @harshad16

mayaCostantini commented 2 years ago

Following a discussion with @harshad16, a way to improve the limit would be to

[ ] observe a spike in memory consumption when running an advise on a stack that caused OOM issues in previous integration tests
[ ] run advises for similar software stacks using dependency monkey and average the values obtained to better approximate the limit we should fix

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

VannTen commented 2 years ago

We could add inter-process communication that would guarantee that stack_info is propagated correctly to the parent process once the child process is killed on OOM.

I'm pretty sure this is impossible. The oom (either from cgroup or the kernel) send a SIGKILL, which cannot be handled in any way by the process itself.

However we might check stuff like resource.setrlimit then catching MemoryError ?

Should we still keep this issue open anyway and investigate on an existing OpenShift feature to prevent against OOM errors? Would a liveness probe be too expensive or not adapted?

Liveness probe won't help, it's more for "blocked" kind of problem

In parallel, we can continue looking for an openshift feature to check on the pod memory consumption regularly

I think the prometheus node-exporter exposes the containers memory usage ? Not sure if it's included on Openshift ?

goern commented 2 years ago

I think we should turn this into an operational issue: let's observe and alarm, then adjust.

if we can see how many pods get oom killed, at which memory limit, we should be able to manually adjust the limit. using dependency monkey to generate a larger set of observations. having statistics about the frequency of ook kills in relation to the memory consumption should give us a good priority for further actions/implementation/improvements.

if we see a oom killed advise, can we adjust the memory limit and rerun it?

VannTen commented 2 years ago

I think we should turn this into an operational issue: let's observe and alarm, then adjust. +1

if we see a oom killed advise, can we adjust the memory limit and rerun it?

Sure, if we are the producer (aka talking to the k8s api) of the pod resource or it's parent resource(workflow, job, whatever). (In that case we should probably have a RestartPolicy: Never to handle failures ourselves)

VannTen commented 2 years ago

So we have three separate things:

[ ] observe -> create alerts based on exit-code for OOM + investigate on memory consumption in that case. -> this depends on https://github.com/thoth-station/metrics-exporter/issues/725 (although, it might be possible to use memory consumption alone to detect this. I believe grafana has a sort of "restart detector", it might fit the job with some PromQl wizardry. I'd rather use the codes if available ^^) (in that case we would manually adjust the memory limits in the manifests).
[ ] Have the pod producer consume the metrics to be able to re-create the job/workflow with larger memory limit
[ ] Turn the OOM into a MemoryException exception by using setrlimit so we can do something from python code (or just have a strack track in the logs)

From what I understood @goern , we would do 1 to see if we need 2 and/or 3 ?

VannTen commented 2 years ago

if we see a oom killed advise, can we adjust the memory limit and rerun it?

Sure, if we are the producer (aka talking to the k8s api) of the pod resource or it's parent resource(workflow, job, whatever). (In that case we should probably have a RestartPolicy: Never to handle failures ourselves)

To clarify a bit of my earlier statement, technically we would not rerun it, but create another one (pods/parent resource) with adjusted resources. But it's functionally equivalent

thoth-station / adviser

Advise does not give a stack_info when OOM is thrown #2203