Open christianjans opened 3 years ago
How can we install some additional packages in the container? There seem to be some permission issues to do so on Compute Canada.
@2017soft,
How can we install some additional packages in the container?
I believe if you also bind the certificates of the Compute Canada machine to the same directory of the Singularity container, then the certificates needed to install extra packages should be found. That is, run it like this:
$ singularity shell --bind /etc/pki/tls/certs/ca-bundle.crt,SMARTS/:/SMARTS smarts-0416_singularity.sif
And then try pip install ...
or whatever.
This works for some packages, but I know when I tried to setup SMARTS this way, some packages wouldn't install. If this is the case for you, perhaps I would try changing the Dockerfile of SMARTS so that it also installs the packages you need. https://github.com/huawei-noah/SMARTS/blob/c5a16088152e840dbaa9afe302e8bb9338bc6a93/Dockerfile#L62-L63 Then you can build and push this new image to your DockerHub (see http://jsta.github.io/r-docker-tutorial/04-Dockerhub.html), and then use this image instead of huaweinoah/smarts:v0.4.16
.
Thanks for the instructions! I ended up using the Docker method and it seems to work now. However, when I tried to run SMARTS using a different algorithm, the following error shows up:
RendererException: Error in initializing framework for opening graphical display
and creating scene graph. A typical reason is display not found. Try running
with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this
does not work please consult the documentation.
Exception was: {e}
ERROR:RemoteAgentBuffer:Exception while tearing down buffered remote agent. ValueError('Cannot invoke RPC on closed channel!')
I have tried export DISPLAY= :0, :1, ...
, but they didn't seem to work. What might be happening here? There seem to be a similar issue that was mentioned here https://github.com/huawei-noah/SMARTS/issues/786
Hi @2017soft, no worries, glad it seemed to work. Regarding your issue, yes we have seen that before. Unfortunately, I am not getting this error though when running SMARTS using Singularity on Compute Canada. Could you provide some more information? For example, what command did you use to SSH into Compute Canada? When running this experiment, did you use a compute node (instead of a login node) on Compute Canada? What command did you run to execute this experiment?
@Gamenot, do you have any comments or advice?
Hi @christianjans, I used ssh -Y userID@graham.computecanada.ca
to log in to Compute Canada.
I have tried both compute node and login node, but they both gave the same error.
I was running code from another Git repository, but I imported some SMARTS modules into some of the files in that repository.
@2017soft would this work? singularity shell --bind $PWD:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif
Hi @reneeleung @christianjans, I have just figured out the problem. It turns out that I have to downgrade XQuartx from 2.8.1 to 2.7.8 on my MacBook. The DISPLAY starts working after that.
Also, @reneeleung is right. We have to set DISPLAY inside Singularity to be equal to the DISPLAY outside the container as well.
Okay great, thank you @reneeleung! I will update the instructions.
hey! is anyone encoutering this error while executing supervisord
:
ImportError: libmkl_rt.so: cannot open shared object file: No such file or directory
thanks, @christianjans, Some cache files were stored in my home directory, causing some path issues and thus this error. NVM, this was some mistake from my end. The above-given steps work perfectly.
Hi @Dikshuy, okay great to hear! Glad the issue was resolved.
Thanks, @christianjans for the help here. I can now run salloc
and srun
with X11 forwarding. However, it seems that sbatch
does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS using sbatch
)?
@2017soft,
How can we install some additional packages in the container?
I believe if you also bind the
/etc
directory of the Compute Canada machine to the/etc
directory of the Singularity container, then the certificates needed to install extra packages should be found. That is, run it like this:$ singularity shell --bind /etc/,SMARTS/:/SMARTS smarts-0416_singularity.si
And then try
pip install ...
or whatever.This works for some packages, but I know when I tried to setup SMARTS this way, some packages wouldn't install. If this is the case for you, perhaps I would try changing the Dockerfile of SMARTS so that it also installs the packages you need.
Then you can build and push this new image to your DockerHub (see http://jsta.github.io/r-docker-tutorial/04-Dockerhub.html), and then use this image instead of
huaweinoah/smarts:v0.4.16
.
Hi! When I bind /etc/, my python and pip are no longer found:
Please help
Hi @mansur007, try binding just the certificate, something like:
singularity shell --bind /etc/pki/tls/certs/ca-bundle.crt,SMARTS:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif
I will update the comment.
However, it seems that sbatch does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS using sbatch)?
Good point @2017soft, I will look into this as well.
@2017soft, I am able to run a SMARTS experiment using sbatch
with the following job script:
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH --mem=16G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks=1
#SBATCH --output=slurm-%j.out
# This file should be in the /SMARTS directory, the Singularity file should
# be outside of the /SMARTS directory.
module load singularity
singularity exec --bind ../SMARTS/:/SMARTS --env DISPLAY=$DISPLAY,PYTHONPATH=/SMARTS:/src --home /SMARTS ../smarts-0416_singularity.sif supervisord
Feel free to let me know of any problems you encounter.
@christianjans I just tried it and unfortunately it does not work for my training script. I think the above command works because it does not render camera observations. If any of rgb
, ogm
, drivable_area_grid_map
, and lidar
are set to be True
in AgentInterface
, then SBATCH will no longer be working because it does not have DISPLAY
. My training script needs the image from the camera observations.
@2017soft, ohh okay I see what you mean. But you were able to get camera observations to work when using salloc
and srun
? I have just tried it:
$ ssh cjans@graham.computecanada.ca -Y -L localhost:8081:localhost:8081
$ cd ~/projects/def-<sponsor>/cjans/smarts_singularity/SMARTS/
$ salloc --time=0:30:0 --mem=16G --cpus-per-task=8 --ntasks=1 --x11
$ module load singularity
$ singularity shell --bind ../SMARTS:/SMARTS --env DISPLAY=$DISPLAY ../smarts-0416_singularity.sif
Singularity> cd /SMARTS/
Singularity> export PYTHONPATH=/SMARTS/:$PYTHONPATH
Singularity> supervisord
Where supervisord
runs a custom example script similar to single_agent.py
, but the agent has the TopDownRGB in its observation, and I get:
╭────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────╮
│ Episode │ Sim T / Wall T │ Total Steps │ Steps / Sec │ Scenario Map │ Scenario Routes │ Mission (Hash) │ Scores │
├────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
Retrying in 0.05 seconds
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
:display:x11display(error): GLXBadContext
:display:glxdisplay(error): Could not find a usable pixel format.
:ShowBase(warning): Unable to open 'offscreen' window.
ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
ERROR:SMARTS:Error in initializing framework for opening graphical display and creating scene graph. A typical reason is display not found. Try running with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this does not work please consult the documentation.
Exception was: {e}
Traceback (most recent call last):
File "/SMARTS/smarts/core/renderer.py", line 95, in init
super().__init__(windowType="offscreen")
File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 338, in __init__
self.openDefaultWindow(startDirect = False, props=props)
File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 1020, in openDefaultWindow
self.openMainWindow(*args, **kw)
File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 1055, in openMainWindow
self.openWindow(*args, **kw)
File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 800, in openWindow
raise Exception('Could not open window.')
Exception: Could not open window.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/SMARTS/smarts/core/smarts.py", line 136, in step
return self._step(agent_actions)
File "/SMARTS/smarts/core/smarts.py", line 194, in _step
self._trap_manager.step(self)
File "/SMARTS/smarts/core/trap_manager.py", line 207, in step
sim, agent_id, trap.mission, trap.default_entry_speed
File "/SMARTS/smarts/core/trap_manager.py", line 289, in _make_vehicle
boid=False,
File "/SMARTS/smarts/core/vehicle_index.py", line 581, in build_agent_vehicle
hijacking=False,
File "/SMARTS/smarts/core/utils/cache.py", line 130, in wrapper
return func(self, *args, **kwargs)
File "/SMARTS/smarts/core/vehicle_index.py", line 602, in _enfranchise_actor
sim, vehicle, agent_interface, sensor_state.mission_planner
File "/SMARTS/smarts/core/vehicle.py", line 470, in attach_sensors_to_vehicle
if not sim.renderer:
File "/SMARTS/smarts/core/smarts.py", line 439, in renderer
self._renderer = Renderer(self._sim_id)
File "/SMARTS/smarts/core/renderer.py", line 161, in __init__
self._showbase_instance = _ShowBaseInstance()
File "/SMARTS/smarts/core/renderer.py", line 85, in __new__
it.init()
File "/SMARTS/smarts/core/renderer.py", line 110, in init
) from e
smarts.core.renderer.RendererException: Error in initializing framework for opening graphical display and creating scene graph. A typical reason is display not found. Try running with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this does not work please consult the documentation.
Exception was: {e}
Were these the steps you took to get images observations to work? If so, no worries I will continue to look into it. Otherwise, what did you do differently?
@christianjans Yes, salloc
and srun
do work for camera observations on my end.
And yes, I used similar commands to run experiments with camera observations as well. To run salloc
and srun
successfully, I think I have to downgrade my XQuartz from 2.8.1 to 2.7.8 as well on my MacBook.
Oh that's right, thanks! I just downgraded as well, and it seems to be working fine when allocating resources with salloc
and srun
👍🏻.
@christianjans I wonder if it is possible to just turn off the DISPLAY
. When I run the experiments using GPU, the following errors show up:
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
It just keeps showing this error :display:gsg:glgsg(error): Unable to detect OpenGL version
@2017soft,
I can now run
salloc
andsrun
with X11 forwarding. However, it seems thatsbatch
does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS usingsbatch
)?
I have been looking into this too, but am thinking that your assumption is correct in that by allocating resources through sbatch
, X11 forwarding cannot be done. However, if this is needed, I would contact Compute Canada support (https://docs.computecanada.ca/wiki/Technical_support) to see if they have any solutions.
I wonder if it is possible to just turn off the
DISPLAY
.
I have talked to @Gamenot about this, and in future versions of SMARTS, this display issue will not be a problem if camera observations are not needed. However, if camera observations are required by agents, then rendering will be needed in order to produce these observations, and this rendering needs this X11 DISPLAY
environment variable (not sure about the exact details here).
When I run the experiments using GPU, ...
I have been looking into running SMARTS/ULTRA using the GPU on Compute Canada as well, and in order to do so, I...
FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
so that the image would have NVIDIA drivers.salloc
'd GPU resources by setting the --gres
argument (see https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm).singularity shell
with the --nv
flag (https://sylabs.io/guides/3.5/user-guide/gpu.html) on the GPU compute node.If your process to get SMARTS running with GPUs on Compute Canada is different, can you describe what you did?
When I try and run a script that has an agent with a PyTorch network, it sometimes succeeds, but I also sometimes get that :display:gsg:glgsg(error): Unable to detect OpenGL version
error too. I found that if I get this error, I can exit
out of the current Singularity container I am in, then simply restart the Singularity container (with singularity shell --nv ...
) and then it sometimes works, but other times still does not.
So I am not sure what is going on here with this error, I will try and look into it more. And again, Compute Canada support might have some ideas too.
@christianjans I have already got SMARTS running with GPU on Compute Canada. I used more or less the same process except I used nvidia/cudagl:11.0.3-devel-ubuntu18.04
as the base Docker image. This has OpenGL installed and then we can install cudnn
on top of the base image by using apt-get install -y cudnn8
.
Ahh okay great @2017soft, are you still encountering this error?
Also, just for-your-information, I have submitted #993 to hopefully get more information and help on running SMARTS with camera observations through a non-interactive job (i.e. when submitting jobs through sbatch
).
@christianjans Thanks a lot for submitting the issue for non-interactive Slurm jobs. I have emailed Compute Canada as well and got a response from them today:
Unfortunately, there is no way to run jobs with sbatch and X11 forwarding. You have to find a way to run your code in batch mode. If there is no way to disable this feature, your application is not suitable for batch mode and you have to use only interactive jobs via salloc.
It looks like it may not be possible to run sbatch
jobs with X11 forwarding. As you mentioned in the new issue, I also wonder how possible it is to remove the X11 dependency from SMARTS.
Also, by using nvidia/cudagl:11.0.3-devel-ubuntu18.04
as the base image, I no longer have the OpenGL error :display:gsg:glgsg(error): Unable to detect OpenGL version
here anymore. I can run the interactive training jobs without any issues.
Okay great, thanks for confirming with Compute Canada support. And oh awesome, so glad you resolved the OpenGL error! 👍
TODO: Introduction of this discussion
Feel free to put all questions, comments, etc. about running SMARTS and ULTRA on Compute Canada in this discussion issue.
Running SMARTS on Compute Canada
Running ULTRA on Compute Canada
Follow the steps above to obtain
smarts-0416_singularity.sif
andSMARTS/
.