[BUG] - Conda-store environments still reporting "building" after 24 hours

rsignell-usgs commented 1 year ago

Describe the bug

I have several environments that still say "building" after 24 hours: 2022-11-18_16-55-46

If I ssh into the conda-store-worker pod, I see this when I run top: 2022-11-18_17-00-08

I'm not sure if this is a bug. How best to proceed?

Expected behavior

environments build successfully or fail

OS and architecture in which you are running Nebari

Linux, AWS

How to Reproduce the problem?

Not sure

Command output

No response

Versions and dependencies used.

Nebari (qhub) 0.4.4

Compute environment

AWS

Integrations

Keycloak, conda-store, Dask, Argo

Anything else?

No response

iameskild commented 1 year ago

Hi @rsignell-usgs, thanks for bringing this to our attention! I've not seen this issue before so I'm unsure what the underlying cause might be. Can you please provide logs for the conda-store-worker and conda-store-server pods? Do you also happen to know how much space the conda-store storage has remaining?

cc @costrouc

rsignell-usgs commented 1 year ago

@iameskild thanks for taking a look at this.

conda-store is only 25% full (I've been regularly keeping a watch on this and deleting old environments)

conda-store-worker log

 conda-store-worker [2022-11-21 14:10:00,746: ERROR/ForkPoolWorker-4] Task task_build_conda_docker[4f2b0176-2e99-4c69 ││ conda-store-worker Traceback (most recent call last):                                                                ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ││ conda-store-worker     R = retval = fun(*args, **kwargs)                                                             ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ││ conda-store-worker     return self.run(*args, **kwargs)                                                              ││ conda-store-worker   File "/opt/conda-store-server/conda_store_server/worker/tasks.py", line 114, in task_build_cond ││ conda-store-worker     build_conda_docker(conda_store, build)                                                        ││ conda-store-worker   File "/opt/conda-store-server/conda_store_server/build.py", line 296, in build_conda_docker     ││ conda-store-worker     image = build_docker_environment_image(                                                       ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     add_conda_layers(                                                                             ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     add_conda_package_layers(                                                                     ││ conda-store-worker   File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ││ conda-store-worker     with open(meta_path) as f:                                                                    ││ conda-store-worker FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp8ixguert/opt/conda/conda-meta/_l ││

conda-store-server pod logs seem fine.

So I shouldn't just kill the conda-store processes running in the container, right?

aktech commented 1 year ago

@rsignell-usgs can you also share the environment, just in case if we can reproduce from that.

pavithraes commented 1 year ago

@rsignell-usgs Thanks again for reporting this! As @aktech mentioned, it'll be helpful to be able to reproduce it.

Since it has been a while, could you please share your environment file if you see this happening again?

We can leave this issue open for a couple of weeks, and then close it if nobody sees this behavior. We can always re-open if needed. :)

rsignell-usgs commented 1 year ago

Sorry, this one slipped off my radar and we ended up destroying this deploy due to other issues.

rsignell commented 6 months ago

I'm hitting this problem again with my Open Science Computing Nebari deployment (2024.3.2).
I had successfully built a large environment in conda-store in about 17 minutes. I then added one additional package to install via pip and it's been "building" for the last 24 hours (still building).

In k9s, I see a 2nd conda-store worker pod with "ContainerStatusUnknown":

How best to handle this?

rsignell commented 6 months ago

I clicked "d" on the pod above shown in red, and it says:

I checked the space available on the conda-store pod and there is plenty:

Filesystem      Size  Used Avail Use% Mounted on
overlay          50G   32G   19G  63% /
tmpfs            64M     0   64M   0% /dev
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1   50G   32G   19G  63% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs            30G  4.0K   30G   1% /var/lib/conda-store
tmpfs            30G   12K   30G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            16G     0   16G   0% /proc/acpi
tmpfs            16G     0   16G   0% /sys/firmware

rsignell commented 6 months ago

After deleting the duplicate conda-store worked pod pending (above pod shown in red), things seemed to be okay with k9s (no red pods).

I then tried building my environment again, and it said "building". But then I noticed that the previous build also still said "building". And then neither one finished building (I waited a few hours).

I then tried logging in 24 hours later to see if anything had finished or changed, I found I could not even login to conda store, and k9s looks like this:

I guess it's time to destroy.... :(

viniciusdc commented 6 months ago

Oh, that's something you don't see everyday.

There are two main things going on right now:

The conda-store issue you are facing of stuck builds and the access to the webpage
The services with red name

For the conda-store:

I would get the logs from the worker and check if there are any errors there. I assume you might see something regarding an SQL inclusion error

actions:

restart the conda-store-server pod (wait for the service to be back around 30s) (Before terminating the instance, go into logs and press CTRL+S to save the files, if possible. They will be somewhere on your computer. )
go into /conda-store, copy the env yaml for the current stuck envs to some other place in your computer delete those environments (unless you want the version history of those)
restart conda-store-worker pod (save the logs as well is possible)
submit the new builds for the ensavironments using the create env workflow

Now, the red services. Usually the main causes for this are:

Incorrect mounting or volume errors (I don't think its the case here)
Issues when running the health checks on the exported ports

When this happens, the important thing is to make sure it's in a running state, which is the case for almost all of the services there. Good thing.

I am concerned by the ebs-csi-controller though as its usually associated to scaling and node stuff on aws/k8s

rsignell commented 6 months ago

@viniciusdc I tried to get the logs from k9s, but every "red" pod I tried refused to return logs. I didn't see anything either doing "d" on the pods.

From the console, I did notice that the dbs-csi-controller referenced something about kubernetes 1.39, while my nebari 2024.03.02 config is still specifying 1.26.

So my plan is to:

destroy the 2024.03.02 deployment
upgrade the config to 2024.03.03
make sure the upgrade process updates kubernetes in the config
deploy 2024.03.03
try to create the environment again

rsignell commented 6 months ago

I destroyed the 2024.03.02 deployment, upgraded the config, changed kubernetes from 1.26 to 1.29 in the config (the upgrade process apparently didn't do that), and deployed 2024.03.03.

The deployment went fine, but when I tried to create the environment again, the build in conda-store stayed pending for 4 hours (when I logged into the conda-store worker, the cpu was 100% and memory use 5%).

So I destroyed again.

Here is the environment.yml that I'm trying to build.

The crazy thing is it built okay without the last package (the cmip6-downscaling package)

viniciusdc commented 6 months ago

Thanks for the details @rsignell . The only thing that comes to mind would be conda trying to prepare a dependency report. Were you able to build the enc locally?

rsignell commented 6 months ago

I will try to build locally. That's a good idea!

nebari-dev / nebari