nebari-dev / nebari

ðŸŠī Nebari - your open source data science platform
https://nebari.dev
BSD 3-Clause "New" or "Revised" License
279 stars 91 forks source link

[BUG] - Conda-store environments still reporting "building" after 24 hours #1559

Open rsignell-usgs opened 1 year ago

rsignell-usgs commented 1 year ago

Describe the bug

I have several environments that still say "building" after 24 hours: 2022-11-18_16-55-46

If I ssh into the conda-store-worker pod, I see this when I run top: 2022-11-18_17-00-08

I'm not sure if this is a bug. How best to proceed?

Expected behavior

environments build successfully or fail

OS and architecture in which you are running Nebari

Linux, AWS

How to Reproduce the problem?

Not sure

Command output

No response

Versions and dependencies used.

Nebari (qhub) 0.4.4

Compute environment

AWS

Integrations

Keycloak, conda-store, Dask, Argo

Anything else?

No response

iameskild commented 1 year ago

Hi @rsignell-usgs, thanks for bringing this to our attention! I've not seen this issue before so I'm unsure what the underlying cause might be. Can you please provide logs for the conda-store-worker and conda-store-server pods? Do you also happen to know how much space the conda-store storage has remaining?

cc @costrouc

rsignell-usgs commented 1 year ago

@iameskild thanks for taking a look at this.

conda-store is only 25% full (I've been regularly keeping a watch on this and deleting old environments)

conda-store-worker log

 conda-store-worker [2022-11-21 14:10:00,746: ERROR/ForkPoolWorker-4] Task task_build_conda_docker[4f2b0176-2e99-4c69 ││ conda-store-worker Traceback (most recent call last):                                                                ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ││ conda-store-worker     R = retval = fun(*args, **kwargs)                                                             ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ││ conda-store-worker     return self.run(*args, **kwargs)                                                              ││ conda-store-worker   File "/opt/conda-store-server/conda_store_server/worker/tasks.py", line 114, in task_build_cond ││ conda-store-worker     build_conda_docker(conda_store, build)                                                        ││ conda-store-worker   File "/opt/conda-store-server/conda_store_server/build.py", line 296, in build_conda_docker     ││ conda-store-worker     image = build_docker_environment_image(                                                       ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     add_conda_layers(                                                                             ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     add_conda_package_layers(                                                                     ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     with open(meta_path) as f:                                                                    ││ conda-store-worker FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp8ixguert/opt/conda/conda-meta/_l ││                                        

conda-store-server pod logs seem fine.

So I shouldn't just kill the conda-store processes running in the container, right?

aktech commented 1 year ago

@rsignell-usgs can you also share the environment, just in case if we can reproduce from that.

pavithraes commented 1 year ago

@rsignell-usgs Thanks again for reporting this! As @aktech mentioned, it'll be helpful to be able to reproduce it.

Since it has been a while, could you please share your environment file if you see this happening again?

We can leave this issue open for a couple of weeks, and then close it if nobody sees this behavior. We can always re-open if needed. :)

rsignell-usgs commented 1 year ago

Sorry, this one slipped off my radar and we ended up destroying this deploy due to other issues.

rsignell commented 6 months ago

I'm hitting this problem again with my Open Science Computing Nebari deployment (2024.3.2).
I had successfully built a large environment in conda-store in about 17 minutes. I then added one additional package to install via pip and it's been "building" for the last 24 hours (still building).

In k9s, I see a 2nd conda-store worker pod with "ContainerStatusUnknown": image

How best to handle this?

rsignell commented 6 months ago

I clicked "d" on the pod above shown in red, and it says: image

I checked the space available on the conda-store pod and there is plenty:

Filesystem      Size  Used Avail Use% Mounted on
overlay          50G   32G   19G  63% /
tmpfs            64M     0   64M   0% /dev
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1   50G   32G   19G  63% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs            30G  4.0K   30G   1% /var/lib/conda-store
tmpfs            30G   12K   30G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            16G     0   16G   0% /proc/acpi
tmpfs            16G     0   16G   0% /sys/firmware
rsignell commented 6 months ago

After deleting the duplicate conda-store worked pod pending (above pod shown in red), things seemed to be okay with k9s (no red pods).

I then tried building my environment again, and it said "building". But then I noticed that the previous build also still said "building". And then neither one finished building (I waited a few hours).

I then tried logging in 24 hours later to see if anything had finished or changed, I found I could not even login to conda store, and k9s looks like this: image

I guess it's time to destroy.... :(

viniciusdc commented 6 months ago

Oh, that's something you don't see everyday.

There are two main things going on right now:

For the conda-store:

actions:

Now, the red services. Usually the main causes for this are:

When this happens, the important thing is to make sure it's in a running state, which is the case for almost all of the services there. Good thing.

I am concerned by the ebs-csi-controller though as its usually associated to scaling and node stuff on aws/k8s

rsignell commented 6 months ago

@viniciusdc I tried to get the logs from k9s, but every "red" pod I tried refused to return logs. I didn't see anything either doing "d" on the pods.

From the console, I did notice that the dbs-csi-controller referenced something about kubernetes 1.39, while my nebari 2024.03.02 config is still specifying 1.26.

So my plan is to:

rsignell commented 6 months ago

I destroyed the 2024.03.02 deployment, upgraded the config, changed kubernetes from 1.26 to 1.29 in the config (the upgrade process apparently didn't do that), and deployed 2024.03.03.

The deployment went fine, but when I tried to create the environment again, the build in conda-store stayed pending for 4 hours (when I logged into the conda-store worker, the cpu was 100% and memory use 5%).

So I destroyed again.

Here is the environment.yml that I'm trying to build.

The crazy thing is it built okay without the last package (the cmip6-downscaling package)

viniciusdc commented 6 months ago

Thanks for the details @rsignell . The only thing that comes to mind would be conda trying to prepare a dependency report. Were you able to build the enc locally?

rsignell commented 6 months ago

I will try to build locally. That's a good idea!