Open rsignell opened 6 months ago
Could the problem be that we have a minimum of 1 on the general node:
node_groups:
general:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 5
I seem to remember a config shared by @dharhas had:
node_groups:
general:
instance: m5.2xlarge
min_nodes: 2
max_nodes: 5
And if that was the issue, what would I look for in the logs (or AWS console) to indicate that in fact that was the problem?
Uhm, this is interesting. The fact that we can see keycloak-0
being terminated means that either someone manually killed it and was rescheduling, or the whole general instance seems to be scaled up and down, which triggered the services in the terminating node to be rearranged. (That is what it looks like to me.)
Which could explain this part of the previous failure:
unable to login to the hub for about 20 minutes, but it eventually fixed itself
As the new node would be scaled up, the hub pod might have been rearranged to another node (this process sometimes can take up to 15min), then after that, another 30s for the jupyterhub container to start up (assuming the pod was already running) -- this assumption considers that the hub
pod was present in the terminating node 10-10-26-23
Is there a reason why you would need more than a single general node? If you want more nodes, I suggest keeping the min_nodes
key set to your desired amount. While the auto-scaler does work, this process can't be scheduled right now (it's possible, though), which could lead to those situations.
As a follow-up on our side as well, we should check the scaling for the general node, as in theory, as long as there are computing resources being used, the node should not scale down.
And if that was the issue, what would I look for in the logs (or AWS console) to indicate that in fact that was the problem?
@rsignell in your AWS console, if you look for autoscaler groups, you will find the general
one which has its own logs/events, there you can track its calling up/down requests and try to match with the time you've isolated
Okay, we will check that out!
Here's what the general autoscaler activity looks like: So it looks like at 9:31:24 this morning, the one instance supporting the general node group failed, and while another instance quickly replaced it, that caused some problems that took a while to recover from? Is that about right?
I don't really know what to make of it beyond that, hoping you guys do!
OBVI if there is anything we should change to decrease the likelihood of having this issue again, please let us know!
@rsignell did you experience another crash at 9:31:24 where you had to wait 20 minutes again?
Yes
@rsignell can you try to set a bigger instance for the general node group? One potential issue is that your instance is running out of memory and all the services need to be restarted after the instance has been replaced, taking a relatively long time (~15-20 minutes) to be running again.
@rsignell, @marcelovilla comment seems like a good approach to avoid this happening again, as the scaling should not be triggered if the instance has enough resources to keep itself running.
So it looks like at 9:31:24 this morning, the one instance supporting the general node group failed, and while another instance quickly replaced it, that caused some problems that took a while to recover from? Is that about right?
yes, you are right. To find the cause of the instance dying though, you might be able to look for the previous instance ID in the Prometheus logs under Grafana (Explore page) -- If there is any message containing the eviction keyword, then it's definitely a resource-exhausting problem. You could also try checking the CloudTrail logs, as they might contain the error column. However, this is usually just represented for API call exceptions, so anything related to Kubernetes might not show up there.
Ugh. We already have a m5.2xlarge and since it runs 24/7 we wil try to look into the memory use before we make that change
Hi @rsignell,
Ugh. We already have a m5.2xlarge and since it runs 24/7 we wil try to look into the memory use before we make that change
That makes sense, the very first thing I would do would be restrict the general node from scaling so that you can avoid having the services been unreachable, at least, until you get more details on the resource consumption
@andreall can you provide more details here? I believe you told me it crashed for the third time during your class at the exact same day/time again,(Tuesday, 9:19am CET), right?
Hi @rsignell, I'm sorry for the delay in responding; we were organizing some action items for the next release's roadmap. I think you've talked with @marcelovilla recently, was the problem solved?
No, it was not solved. It didn't crash last Tuesday during class though. So we can drop the theory of crashing at same time each week, which is kind of reassuring. That would have been really strange.
Do you need me to send the logs or more info?
Context
We have Nebari 2024.3.3 deployed on AWS and we are using it for two classes of 30 and 20 students each (the classes don't meet at the same time). We've been using the deployment for three weeks and mostly it's working fine, but it's crashed twice (once during class last week, and once when nobody was using it and I tried to start a server.
In both cases Nebari we were unable to login to the hub for about 20 minutes but it eventually fixed itself. Here's what k9s looked like during the failures:
We have Loki and I can isolate the exact minute the system failed, I don't see anything interesting in the Hub or Autoscaler logs. They just stop:
Value and/or benefit
See above.
Anything else?
Any suggestions what to look at in the pod logs? Or something else?
Here's the nebari-config.yaml in case it's of interest.
We have sufficient quota: