Our 2024.3.3 deployment crashes, then fixes itself in 20 minutes

rsignell commented 6 months ago

Context

We have Nebari 2024.3.3 deployed on AWS and we are using it for two classes of 30 and 20 students each (the classes don't meet at the same time). We've been using the deployment for three weeks and mostly it's working fine, but it's crashed twice (once during class last week, and once when nobody was using it and I tried to start a server.

In both cases Nebari we were unable to login to the hub for about 20 minutes but it eventually fixed itself. Here's what k9s looked like during the failures: image (2)

We have Loki and I can isolate the exact minute the system failed, I don't see anything interesting in the Hub or Autoscaler logs. They just stop:

image (1)

Value and/or benefit

See above.

Anything else?

Any suggestions what to look at in the pod logs? Or something else?

Here's the nebari-config.yaml in case it's of interest.

We have sufficient quota:

(pangeo) rsignell@OSC:~/nebari-meteocean$ coiled setup aws --quotas --region eu-west-1
╭────────────────────────────────────────────────────────────────────────────────────────╮
│ Current AWS Quotas                                                                     │
│                                                                                        │
│ Standard On-Demand                    3840 vCPU                                        │
│ Standard Spot                          640 vCPU                                        │
│ G4dn (NVIDIA T4 GPU) On-Demand        1108 vCPU                                        │
│ G4dn (NVIDIA T4 GPU) Spot               64 vCPU                                        │
│ P (NVIDIA V100/A100 GPU) On-Demand     192 vCPU                                        │
│ P (NVIDIA V100/A100 GPU) Spot           64 vCPU                                        │
│                                                                                        │
│ Standard includes:                                                                     │
│     general purpose M and T families (e.g., M6i, T3),                                  │
│     compute optimized C families (e.g., C6i),                                          │
│     memory optimized R families (e.g., R6i).                                           │
│                                                                                        │
│ GPU instances have a separate quotas based on GPU type.

rsignell commented 6 months ago

Could the problem be that we have a minimum of 1 on the general node:

  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 5

I seem to remember a config shared by @dharhas had:

  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 2
      max_nodes: 5

And if that was the issue, what would I look for in the logs (or AWS console) to indicate that in fact that was the problem?

viniciusdc commented 6 months ago

Uhm, this is interesting. The fact that we can see keycloak-0 being terminated means that either someone manually killed it and was rescheduling, or the whole general instance seems to be scaled up and down, which triggered the services in the terminating node to be rearranged. (That is what it looks like to me.)

Which could explain this part of the previous failure:

unable to login to the hub for about 20 minutes, but it eventually fixed itself

As the new node would be scaled up, the hub pod might have been rearranged to another node (this process sometimes can take up to 15min), then after that, another 30s for the jupyterhub container to start up (assuming the pod was already running) -- this assumption considers that the hub pod was present in the terminating node 10-10-26-23

Is there a reason why you would need more than a single general node? If you want more nodes, I suggest keeping the min_nodes key set to your desired amount. While the auto-scaler does work, this process can't be scheduled right now (it's possible, though), which could lead to those situations.

As a follow-up on our side as well, we should check the scaling for the general node, as in theory, as long as there are computing resources being used, the node should not scale down.

viniciusdc commented 6 months ago

And if that was the issue, what would I look for in the logs (or AWS console) to indicate that in fact that was the problem?

@rsignell in your AWS console, if you look for autoscaler groups, you will find the general one which has its own logs/events, there you can track its calling up/down requests and try to match with the time you've isolated

rsignell commented 6 months ago

Okay, we will check that out!

rsignell commented 6 months ago

Here's what the general autoscaler activity looks like: So it looks like at 9:31:24 this morning, the one instance supporting the general node group failed, and while another instance quickly replaced it, that caused some problems that took a while to recover from? Is that about right?

I don't really know what to make of it beyond that, hoping you guys do!

OBVI if there is anything we should change to decrease the likelihood of having this issue again, please let us know!

marcelovilla commented 6 months ago

@rsignell did you experience another crash at 9:31:24 where you had to wait 20 minutes again?

rsignell commented 6 months ago

Yes

marcelovilla commented 6 months ago

@rsignell can you try to set a bigger instance for the general node group? One potential issue is that your instance is running out of memory and all the services need to be restarted after the instance has been replaced, taking a relatively long time (~15-20 minutes) to be running again.

viniciusdc commented 6 months ago

@rsignell, @marcelovilla comment seems like a good approach to avoid this happening again, as the scaling should not be triggered if the instance has enough resources to keep itself running.

So it looks like at 9:31:24 this morning, the one instance supporting the general node group failed, and while another instance quickly replaced it, that caused some problems that took a while to recover from? Is that about right?

yes, you are right. To find the cause of the instance dying though, you might be able to look for the previous instance ID in the Prometheus logs under Grafana (Explore page) -- If there is any message containing the eviction keyword, then it's definitely a resource-exhausting problem. You could also try checking the CloudTrail logs, as they might contain the error column. However, this is usually just represented for API call exceptions, so anything related to Kubernetes might not show up there.

rsignell commented 6 months ago

Ugh. We already have a m5.2xlarge and since it runs 24/7 we wil try to look into the memory use before we make that change

viniciusdc commented 6 months ago

Hi @rsignell,

Ugh. We already have a m5.2xlarge and since it runs 24/7 we wil try to look into the memory use before we make that change

That makes sense, the very first thing I would do would be restrict the general node from scaling so that you can avoid having the services been unreachable, at least, until you get more details on the resource consumption

rsignell commented 6 months ago

@andreall can you provide more details here? I believe you told me it crashed for the third time during your class at the exact same day/time again,(Tuesday, 9:19am CET), right?

viniciusdc commented 6 months ago

Hi @rsignell, I'm sorry for the delay in responding; we were organizing some action items for the next release's roadmap. I think you've talked with @marcelovilla recently, was the problem solved?

rsignell commented 5 months ago

No, it was not solved. It didn't crash last Tuesday during class though. So we can drop the theory of crashing at same time each week, which is kind of reassuring. That would have been really strange.

andreall commented 5 months ago

Do you need me to send the logs or more info?

nebari-dev / nebari