isaac-sim / IsaacLab

Unified framework for robot learning built on NVIDIA Isaac Sim
https://isaac-sim.github.io/IsaacLab
Other
1.99k stars 794 forks source link

[Bug Report] destination /storage doesn't exist in container #192

Closed LarsDoorenbos closed 3 weeks ago

LarsDoorenbos commented 9 months ago

Describe the bug

Following these steps to deploy Orbit on a slurm cluster, there is a fatal error in creating the container stating that the destination /storage doesn't exist in container.

Steps to reproduce

Follow the steps in the Cluster guide to run Orbit on a HPC cluster with slurm.

My local machine starts a job on the cluster, which in turn starts the job that builds the apptainer. The container creation fails due to an error with mounting. The log and error from the job that builds the apptainer looks as follows:

(run_singularity.py): Called on compute node with arguments
WARNING: nv files may not be bound with --writable
WARNING: By using --writable, Apptainer can't create /storage destination automatically without overlay or underlay
FATAL:   container creation failed: mount hook function failure: mount /var/apptainer/mnt/session/storage->/storage error: while mounting /var/apptainer/mnt/session/storage: destination /storage doesn't exist in container
(run_singularity.py): Return

This error is mentioned in some docs, where the note says to "add directories in the container for each of the bind mounts explicitly", but it is unclear to me how to fix it in this context.

My docker/.env does not specify /storage as a path anywhere, only as a prefix:

# Accept the NVIDIA Omniverse EULA by default
ACCEPT_EULA=Y
# NVIDIA Isaac Sim version to use (e.g. 2022.2.1)
ISAACSIM_VERSION=2023.1.0-hotfix.1
# Derived from the default path in the NVIDIA provided Isaac Sim container
DOCKER_ISAACSIM_PATH=/isaac-sim
# Docker user directory - by default this is the root user's home directory
DOCKER_USER_HOME=/root

###
# Cluster specific settings
###

# Docker cache dir for Isaac Sim (has to end on docker-isaac-sim)
# e.g. /cluster/scratch/$USER/docker-isaac-sim
CLUSTER_ISAAC_SIM_CACHE_DIR=/storage/workspaces/a*****/w****/lars/docker-isaac-sim
# Orbit directory on the cluster (has to end on orbit)
# e.g. /cluster/home/$USER/orbit
CLUSTER_ORBIT_DIR=/storage/homefs/l******/orbit
# Cluster login
CLUSTER_LOGIN=*****@****.ch
# Cluster scratch directory to store the SIF file
# e.g. /cluster/scratch/$USER
CLUSTER_SIF_PATH=/storage/workspaces/a*****/w****/lars
# Python executable within orbit directory to run with the submitted job
CLUSTER_PYTHON_EXECUTABLE=source/standalone/tutorials/00_sim/create_empty.py

System Info

Describe the characteristic of your environment:

Checklist

masoudmoghani commented 9 months ago

@AutonomousHansen would you be able to help with this? Thanks

hhansen-bdai commented 9 months ago

@pascal-roth Do you have any ideas here? The only place I can see that a bind mount could potentially be causing problems is here, but the error message is saying that the directory doesn't exist on the container instead of the host?

pascal-roth commented 9 months ago

@AutonomousHansen I agree that line is the most probable cause. @LarsDoorenbos, can you ensure that the logs directory exists within your orbit directory on the cluster, it won't be synced to it and can be missing the first time you want to run the code.

LarsDoorenbos commented 9 months ago

@pascal-roth Yes, the logs directory exists in the cluster Orbit directory.

LarsDoorenbos commented 9 months ago

In the line mentioned by @AutonomousHansen, this logs directory is mounted to /workspace/orbit/logs, but adding ls /workspace to the run script gives ls: cannot access /workspace: No such file or directory. Maybe it should be mounted to a different place?

EDIT: changing /workspace/orbit to /storage/homefs/l******/orbit where orbit is located still gives the same error.

pascal-roth commented 9 months ago

that is clear, /workspace is defined within your docker image and cannot be accessed from outside. It is the directory where during the docker build process orbit is copied and installed (see here)

Try to comment out the line where logs are bound to the image, then we can be certain if this is causing the issue.

LarsDoorenbos commented 9 months ago

Removing the -B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \ line still gives the same error.

LarsDoorenbos commented 9 months ago

Manually adding the /storage folder to the orbit.sif folder does work, but now it gives an error with another folder:

(run_singularity.py): Called on compute node with arguments
WARNING: nv files may not be bound with --writable
WARNING: By using --writable, Apptainer can't create /root/.cache/ov destination automatically without overlay or underlay
FATAL:   container creation failed: mount hook function failure: mount /scratch/local/4319483/docker-isaac-sim/cache/ov->/root/.cache/ov error: while mounting /scratch/local/4319483/docker-isaac-sim/cache/ov: destination /root/.cache/ov doesn't exist in container
(run_singularity.py): Return

However, unlike before, the /root/.cache/ov does exist in the orbit.sif folder, so I can not do the same trick again...

Removing some of the binds gives the same error for a different bind, e.g. FATAL: container creation failed: mount hook function failure: mount /scratch/local/4319775/docker-isaac-sim/documents->/root/Documents error: while mounting /scratch/local/4319775/docker-isaac-sim/documents: destination /root/Documents doesn't exist in container, so something seems to be going wrong with the mounting in general.

pascal-roth commented 8 months ago

which apptainer version are you using on the cluster?

LarsDoorenbos commented 8 months ago

apptainer version 1.1.3-1.el7. Maybe I should ask for an update ;)

pascal-roth commented 8 months ago

I agree; it seems like a general mounting error. It is difficult to reproduce from our side as we are running apptainer version 1.2.5-1.el7.

LarsDoorenbos commented 8 months ago

For now, we found a different machine on which to run Orbit. Thanks anyway!

pascal-roth commented 3 weeks ago

Closing this issue for now, seems to be resolved.