Closed LarsDoorenbos closed 3 weeks ago
@AutonomousHansen would you be able to help with this? Thanks
@pascal-roth Do you have any ideas here? The only place I can see that a bind mount could potentially be causing problems is here, but the error message is saying that the directory doesn't exist on the container instead of the host?
@AutonomousHansen I agree that line is the most probable cause. @LarsDoorenbos, can you ensure that the logs
directory exists within your orbit directory on the cluster, it won't be synced to it and can be missing the first time you want to run the code.
@pascal-roth Yes, the logs
directory exists in the cluster Orbit directory.
In the line mentioned by @AutonomousHansen, this logs
directory is mounted to /workspace/orbit/logs
, but adding ls /workspace
to the run script gives ls: cannot access /workspace: No such file or directory
. Maybe it should be mounted to a different place?
EDIT: changing /workspace/orbit
to /storage/homefs/l******/orbit
where orbit is located still gives the same error.
that is clear, /workspace
is defined within your docker image and cannot be accessed from outside. It is the directory where during the docker build process orbit is copied and installed (see here)
Try to comment out the line where logs are bound to the image, then we can be certain if this is causing the issue.
Removing the -B $CLUSTER_ORBIT_DIR/logs:/workspace/orbit/logs:rw \
line still gives the same error.
Manually adding the /storage folder to the orbit.sif folder does work, but now it gives an error with another folder:
(run_singularity.py): Called on compute node with arguments
WARNING: nv files may not be bound with --writable
WARNING: By using --writable, Apptainer can't create /root/.cache/ov destination automatically without overlay or underlay
FATAL: container creation failed: mount hook function failure: mount /scratch/local/4319483/docker-isaac-sim/cache/ov->/root/.cache/ov error: while mounting /scratch/local/4319483/docker-isaac-sim/cache/ov: destination /root/.cache/ov doesn't exist in container
(run_singularity.py): Return
However, unlike before, the /root/.cache/ov
does exist in the orbit.sif folder, so I can not do the same trick again...
Removing some of the binds gives the same error for a different bind, e.g. FATAL: container creation failed: mount hook function failure: mount /scratch/local/4319775/docker-isaac-sim/documents->/root/Documents error: while mounting /scratch/local/4319775/docker-isaac-sim/documents: destination /root/Documents doesn't exist in container
, so something seems to be going wrong with the mounting in general.
which apptainer version are you using on the cluster?
apptainer version 1.1.3-1.el7
. Maybe I should ask for an update ;)
I agree; it seems like a general mounting error. It is difficult to reproduce from our side as we are running apptainer version 1.2.5-1.el7
.
For now, we found a different machine on which to run Orbit. Thanks anyway!
Closing this issue for now, seems to be resolved.
Describe the bug
Following these steps to deploy Orbit on a slurm cluster, there is a fatal error in creating the container stating that the destination /storage doesn't exist in container.
Steps to reproduce
Follow the steps in the Cluster guide to run Orbit on a HPC cluster with slurm.
My local machine starts a job on the cluster, which in turn starts the job that builds the apptainer. The container creation fails due to an error with mounting. The log and error from the job that builds the apptainer looks as follows:
This error is mentioned in some docs, where the note says to "add directories in the container for each of the bind mounts explicitly", but it is unclear to me how to fix it in this context.
My
docker/.env
does not specify/storage
as a path anywhere, only as a prefix:System Info
Describe the characteristic of your environment:
Checklist