Open rsignell-usgs opened 1 year ago
Hi @rsignell-usgs, thanks for bringing this to our attention! I've not seen this issue before so I'm unsure what the underlying cause might be. Can you please provide logs for the conda-store-worker
and conda-store-server
pods? Do you also happen to know how much space the conda-store storage has remaining?
cc @costrouc
@iameskild thanks for taking a look at this.
conda-store is only 25% full (I've been regularly keeping a watch on this and deleting old environments)
conda-store-worker log
conda-store-worker [2022-11-21 14:10:00,746: ERROR/ForkPoolWorker-4] Task task_build_conda_docker[4f2b0176-2e99-4c69 ââ conda-store-worker Traceback (most recent call last): ââ conda-store-worker File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ââ conda-store-worker R = retval = fun(*args, **kwargs) ââ conda-store-worker File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/celery/app/trace.py", lin ââ conda-store-worker return self.run(*args, **kwargs) ââ conda-store-worker File "/opt/conda-store-server/conda_store_server/worker/tasks.py", line 114, in task_build_cond ââ conda-store-worker build_conda_docker(conda_store, build) ââ conda-store-worker File "/opt/conda-store-server/conda_store_server/build.py", line 296, in build_conda_docker ââ conda-store-worker image = build_docker_environment_image( ââ conda-store-worker File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ââ conda-store-worker add_conda_layers( ââ conda-store-worker File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ââ conda-store-worker add_conda_package_layers( ââ conda-store-worker File "/opt/conda/envs/conda-store-server/lib/python3.10/site-packages/conda_docker/conda.py", l ââ conda-store-worker with open(meta_path) as f: ââ conda-store-worker FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp8ixguert/opt/conda/conda-meta/_l ââ
conda-store-server pod logs seem fine.
So I shouldn't just kill the conda-store processes running in the container, right?
@rsignell-usgs can you also share the environment, just in case if we can reproduce from that.
@rsignell-usgs Thanks again for reporting this! As @aktech
mentioned, it'll be helpful to be able to reproduce it.
Since it has been a while, could you please share your environment file if you see this happening again?
We can leave this issue open for a couple of weeks, and then close it if nobody sees this behavior. We can always re-open if needed. :)
Sorry, this one slipped off my radar and we ended up destroying this deploy due to other issues.
I'm hitting this problem again with my Open Science Computing Nebari deployment (2024.3.2).
I had successfully built a large environment in conda-store in about 17 minutes. I then added one additional package to install via pip and it's been "building" for the last 24 hours (still building).
In k9s, I see a 2nd conda-store worker pod with "ContainerStatusUnknown":
How best to handle this?
I clicked "d" on the pod above shown in red, and it says:
I checked the space available on the conda-store pod and there is plenty:
Filesystem Size Used Avail Use% Mounted on
overlay 50G 32G 19G 63% /
tmpfs 64M 0 64M 0% /dev
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/nvme0n1p1 50G 32G 19G 63% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 30G 4.0K 30G 1% /var/lib/conda-store
tmpfs 30G 12K 30G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 16G 0 16G 0% /proc/acpi
tmpfs 16G 0 16G 0% /sys/firmware
After deleting the duplicate conda-store worked pod pending (above pod shown in red), things seemed to be okay with k9s (no red pods).
I then tried building my environment again, and it said "building". But then I noticed that the previous build also still said "building". And then neither one finished building (I waited a few hours).
I then tried logging in 24 hours later to see if anything had finished or changed, I found I could not even login to conda store, and k9s looks like this:
I guess it's time to destroy.... :(
Oh, that's something you don't see everyday.
There are two main things going on right now:
For the conda-store:
actions:
Now, the red services. Usually the main causes for this are:
When this happens, the important thing is to make sure it's in a running state, which is the case for almost all of the services there. Good thing.
I am concerned by the ebs-csi-controller though as its usually associated to scaling and node stuff on aws/k8s
@viniciusdc I tried to get the logs from k9s, but every "red" pod I tried refused to return logs. I didn't see anything either doing "d" on the pods.
From the console, I did notice that the dbs-csi-controller referenced something about kubernetes 1.39, while my nebari 2024.03.02 config is still specifying 1.26.
So my plan is to:
I destroyed the 2024.03.02 deployment, upgraded the config, changed kubernetes from 1.26 to 1.29 in the config (the upgrade process apparently didn't do that), and deployed 2024.03.03.
The deployment went fine, but when I tried to create the environment again, the build in conda-store stayed pending for 4 hours (when I logged into the conda-store worker, the cpu was 100% and memory use 5%).
So I destroyed again.
Here is the environment.yml that I'm trying to build.
The crazy thing is it built okay without the last package (the cmip6-downscaling
package)
Thanks for the details @rsignell . The only thing that comes to mind would be conda trying to prepare a dependency report. Were you able to build the enc locally?
I will try to build locally. That's a good idea!
Describe the bug
I have several environments that still say "building" after 24 hours:
If I ssh into the
conda-store-worker
pod, I see this when I runtop
:I'm not sure if this is a bug. How best to proceed?
Expected behavior
environments build successfully or fail
OS and architecture in which you are running Nebari
Linux, AWS
How to Reproduce the problem?
Not sure
Command output
No response
Versions and dependencies used.
Nebari (qhub) 0.4.4
Compute environment
AWS
Integrations
Keycloak, conda-store, Dask, Argo
Anything else?
No response