huawei-noah / SMARTS

Scalable Multi-Agent RL Training School for Autonomous Driving
MIT License
908 stars 184 forks source link

Running SMARTS/ULTRA on Compute Canada #963

Open christianjans opened 3 years ago

christianjans commented 3 years ago

TODO: Introduction of this discussion

Feel free to put all questions, comments, etc. about running SMARTS and ULTRA on Compute Canada in this discussion issue.

Running SMARTS on Compute Canada

# Login to Compute Canada with Trusted X11 Forwarding and the forwarded port for Envision.
$ ssh <user-name>@<cluster-name>.computecanada.ca -Y -L localhost:8081:localhost:8081

# On your Compute Canada login node, obtain the Docker image for SMARTS and compress it (taken from
# https://docs.computecanada.ca/wiki/Singularity#Creating_an_image_using_Docker_Hub_and_Dockerfile).
$ cd ~/scratch
$ wget https://raw.githubusercontent.com/moby/moby/master/contrib/download-frozen-image-v2.sh
$ sh download-frozen-image-v2.sh smarts-0416_docker huaweinoah/smarts:v0.4.16
$ cd smarts-0416_docker && tar cf ../smarts-0416_docker.tar * && cd ..

# Start an interactive job and build the Singularity container.
$ cd ~/scratch
$ salloc --mem-per-cpu=2000 --cpus-per-task=4 --time=2:0:0 --x11
$ module load singularity
$ singularity build smarts-0416_singularity.sif docker-archive://smarts-0416_docker.tar
$ exit  # Exit out of the interactive job once the Singularity container is built.

# Move the Singularity container back to your projects directory and clone SMARTS.
$ cd ~/scratch
$ mv smarts-0416_singularity.sif ~/projects/<sponsor-name>/<user-name>/
$ cd ~/projects/<sponsor-name>/<user-name>/
$ git clone https://github.com/huawei-noah/SMARTS.git

# Execute the Singularity container and bind your SMARTS directory to the /SMARTS directory in the container.
# After, go to your the SMARTS directory in the container, modify the PYTHONPATH, and run an example!
$ cd ~/projects/<sponsor-name>/<user-name>/
$ singularity shell --bind SMARTS/:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif
Singularity> cd /SMARTS
Singularity> export PYTHONPATH=/SMARTS:$PYTHONPATH
Singularity> supervisord

Running ULTRA on Compute Canada

Follow the steps above to obtain smarts-0416_singularity.sif and SMARTS/.

# Start an interactive job to run an ULTRA experiment.
$ salloc --time=1:0:0 --mem=16G --cpus-per-task=8 --ntasks=1
$ module load singularity
$ singularity shell --bind SMARTS/:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif
Singularity> cd /SMARTS/ultra
Singularity> export PYTHONPATH=/SMARTS/ultra:/SMARTS/:$PYTHONPATH

# Follow instructions in https://github.com/huawei-noah/SMARTS/blob/master/ultra/docs/getting_started.md to
# run the experiment.
2017soft commented 3 years ago

How can we install some additional packages in the container? There seem to be some permission issues to do so on Compute Canada.

christianjans commented 3 years ago

@2017soft,

How can we install some additional packages in the container?

I believe if you also bind the certificates of the Compute Canada machine to the same directory of the Singularity container, then the certificates needed to install extra packages should be found. That is, run it like this:

$ singularity shell --bind /etc/pki/tls/certs/ca-bundle.crt,SMARTS/:/SMARTS smarts-0416_singularity.sif

And then try pip install ... or whatever.

This works for some packages, but I know when I tried to setup SMARTS this way, some packages wouldn't install. If this is the case for you, perhaps I would try changing the Dockerfile of SMARTS so that it also installs the packages you need. https://github.com/huawei-noah/SMARTS/blob/c5a16088152e840dbaa9afe302e8bb9338bc6a93/Dockerfile#L62-L63 Then you can build and push this new image to your DockerHub (see http://jsta.github.io/r-docker-tutorial/04-Dockerhub.html), and then use this image instead of huaweinoah/smarts:v0.4.16.

2017soft commented 3 years ago

Thanks for the instructions! I ended up using the Docker method and it seems to work now. However, when I tried to run SMARTS using a different algorithm, the following error shows up:

RendererException: Error in initializing framework for opening graphical display
and creating scene graph. A typical reason is display not found. Try running 
with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this
does not work please consult the documentation.
Exception was: {e}
ERROR:RemoteAgentBuffer:Exception while tearing down buffered remote agent. ValueError('Cannot invoke RPC on closed channel!')

I have tried export DISPLAY= :0, :1, ..., but they didn't seem to work. What might be happening here? There seem to be a similar issue that was mentioned here https://github.com/huawei-noah/SMARTS/issues/786

christianjans commented 3 years ago

Hi @2017soft, no worries, glad it seemed to work. Regarding your issue, yes we have seen that before. Unfortunately, I am not getting this error though when running SMARTS using Singularity on Compute Canada. Could you provide some more information? For example, what command did you use to SSH into Compute Canada? When running this experiment, did you use a compute node (instead of a login node) on Compute Canada? What command did you run to execute this experiment?

@Gamenot, do you have any comments or advice?

2017soft commented 3 years ago

Hi @christianjans, I used ssh -Y userID@graham.computecanada.ca to log in to Compute Canada.

I have tried both compute node and login node, but they both gave the same error.

I was running code from another Git repository, but I imported some SMARTS modules into some of the files in that repository.

reneeleung commented 3 years ago

@2017soft would this work? singularity shell --bind $PWD:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif

2017soft commented 3 years ago

Hi @reneeleung @christianjans, I have just figured out the problem. It turns out that I have to downgrade XQuartx from 2.8.1 to 2.7.8 on my MacBook. The DISPLAY starts working after that.

2017soft commented 3 years ago

Also, @reneeleung is right. We have to set DISPLAY inside Singularity to be equal to the DISPLAY outside the container as well.

christianjans commented 3 years ago

Okay great, thank you @reneeleung! I will update the instructions.

Dikshuy commented 3 years ago

hey! is anyone encoutering this error while executing supervisord: ImportError: libmkl_rt.so: cannot open shared object file: No such file or directory

Dikshuy commented 3 years ago

thanks, @christianjans, Some cache files were stored in my home directory, causing some path issues and thus this error. NVM, this was some mistake from my end. The above-given steps work perfectly.

christianjans commented 3 years ago

Hi @Dikshuy, okay great to hear! Glad the issue was resolved.

2017soft commented 3 years ago

Thanks, @christianjans for the help here. I can now run salloc and srun with X11 forwarding. However, it seems that sbatch does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS using sbatch)?

mansur007 commented 3 years ago

@2017soft,

How can we install some additional packages in the container?

I believe if you also bind the /etc directory of the Compute Canada machine to the /etc directory of the Singularity container, then the certificates needed to install extra packages should be found. That is, run it like this:

$ singularity shell --bind /etc/,SMARTS/:/SMARTS smarts-0416_singularity.si

And then try pip install ... or whatever.

This works for some packages, but I know when I tried to setup SMARTS this way, some packages wouldn't install. If this is the case for you, perhaps I would try changing the Dockerfile of SMARTS so that it also installs the packages you need.

https://github.com/huawei-noah/SMARTS/blob/c5a16088152e840dbaa9afe302e8bb9338bc6a93/Dockerfile#L62-L63

Then you can build and push this new image to your DockerHub (see http://jsta.github.io/r-docker-tutorial/04-Dockerhub.html), and then use this image instead of huaweinoah/smarts:v0.4.16.

Hi! When I bind /etc/, my python and pip are no longer found: image

Please help

christianjans commented 3 years ago

Hi @mansur007, try binding just the certificate, something like:

singularity shell --bind /etc/pki/tls/certs/ca-bundle.crt,SMARTS:/SMARTS --env DISPLAY=$DISPLAY smarts-0416_singularity.sif

I will update the comment.

christianjans commented 3 years ago

However, it seems that sbatch does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS using sbatch)?

Good point @2017soft, I will look into this as well.

christianjans commented 3 years ago

@2017soft, I am able to run a SMARTS experiment using sbatch with the following job script:

#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH --mem=16G
#SBATCH --cpus-per-task=8
#SBATCH --ntasks=1
#SBATCH --output=slurm-%j.out

# This file should be in the /SMARTS directory, the Singularity file should 
# be outside of the /SMARTS directory.

module load singularity

singularity exec --bind ../SMARTS/:/SMARTS --env DISPLAY=$DISPLAY,PYTHONPATH=/SMARTS:/src --home /SMARTS ../smarts-0416_singularity.sif supervisord

Feel free to let me know of any problems you encounter.

2017soft commented 3 years ago

@christianjans I just tried it and unfortunately it does not work for my training script. I think the above command works because it does not render camera observations. If any of rgb, ogm, drivable_area_grid_map, and lidar are set to be True in AgentInterface, then SBATCH will no longer be working because it does not have DISPLAY. My training script needs the image from the camera observations.

christianjans commented 3 years ago

@2017soft, ohh okay I see what you mean. But you were able to get camera observations to work when using salloc and srun? I have just tried it:

$ ssh cjans@graham.computecanada.ca -Y -L localhost:8081:localhost:8081
$ cd ~/projects/def-<sponsor>/cjans/smarts_singularity/SMARTS/
$ salloc --time=0:30:0 --mem=16G --cpus-per-task=8 --ntasks=1 --x11
$ module load singularity
$ singularity shell --bind ../SMARTS:/SMARTS --env DISPLAY=$DISPLAY ../smarts-0416_singularity.sif
Singularity> cd /SMARTS/
Singularity> export PYTHONPATH=/SMARTS/:$PYTHONPATH
Singularity> supervisord

Where supervisord runs a custom example script similar to single_agent.py, but the agent has the TopDownRGB in its observation, and I get:

╭────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────┬────────────────────╮
│            Episode │     Sim T / Wall T │        Total Steps │        Steps / Sec │       Scenario Map │    Scenario Routes │     Mission (Hash) │             Scores │
├────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┼────────────────────┤
 Retrying in 0.05 seconds
libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
:display:x11display(error): GLXBadContext
:display:glxdisplay(error): Could not find a usable pixel format.
:ShowBase(warning): Unable to open 'offscreen' window.
ERROR:SMARTS:Simulation crashed with exception. Attempting to cleanly shutdown.
ERROR:SMARTS:Error in initializing framework for opening graphical display and creating scene graph. A typical reason is display not found. Try running with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this does not work please consult the documentation.
Exception was: {e}
Traceback (most recent call last):
  File "/SMARTS/smarts/core/renderer.py", line 95, in init
    super().__init__(windowType="offscreen")
  File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 338, in __init__
    self.openDefaultWindow(startDirect = False, props=props)
  File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 1020, in openDefaultWindow
    self.openMainWindow(*args, **kw)
  File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 1055, in openMainWindow
    self.openWindow(*args, **kw)
  File "/usr/local/lib/python3.7/dist-packages/direct/showbase/ShowBase.py", line 800, in openWindow
    raise Exception('Could not open window.')
Exception: Could not open window.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/SMARTS/smarts/core/smarts.py", line 136, in step
    return self._step(agent_actions)
  File "/SMARTS/smarts/core/smarts.py", line 194, in _step
    self._trap_manager.step(self)
  File "/SMARTS/smarts/core/trap_manager.py", line 207, in step
    sim, agent_id, trap.mission, trap.default_entry_speed
  File "/SMARTS/smarts/core/trap_manager.py", line 289, in _make_vehicle
    boid=False,
  File "/SMARTS/smarts/core/vehicle_index.py", line 581, in build_agent_vehicle
    hijacking=False,
  File "/SMARTS/smarts/core/utils/cache.py", line 130, in wrapper
    return func(self, *args, **kwargs)
  File "/SMARTS/smarts/core/vehicle_index.py", line 602, in _enfranchise_actor
    sim, vehicle, agent_interface, sensor_state.mission_planner
  File "/SMARTS/smarts/core/vehicle.py", line 470, in attach_sensors_to_vehicle
    if not sim.renderer:
  File "/SMARTS/smarts/core/smarts.py", line 439, in renderer
    self._renderer = Renderer(self._sim_id)
  File "/SMARTS/smarts/core/renderer.py", line 161, in __init__
    self._showbase_instance = _ShowBaseInstance()
  File "/SMARTS/smarts/core/renderer.py", line 85, in __new__
    it.init()
  File "/SMARTS/smarts/core/renderer.py", line 110, in init
    ) from e
smarts.core.renderer.RendererException: Error in initializing framework for opening graphical display and creating scene graph. A typical reason is display not found. Try running with different configurations of `export DISPLAY=` using `:0`, `:1`... . If this does not work please consult the documentation.
Exception was: {e}

Were these the steps you took to get images observations to work? If so, no worries I will continue to look into it. Otherwise, what did you do differently?

2017soft commented 3 years ago

@christianjans Yes, salloc and srun do work for camera observations on my end.

And yes, I used similar commands to run experiments with camera observations as well. To run salloc and srun successfully, I think I have to downgrade my XQuartz from 2.8.1 to 2.7.8 as well on my MacBook.

christianjans commented 3 years ago

Oh that's right, thanks! I just downgraded as well, and it seems to be working fine when allocating resources with salloc and srun 👍🏻.

2017soft commented 3 years ago

@christianjans I wonder if it is possible to just turn off the DISPLAY. When I run the experiments using GPU, the following errors show up:

libGL error: No matching fbConfigs or visuals found
libGL error: failed to load driver: swrast
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version
:display:gsg:glgsg(error): Unable to detect OpenGL version

It just keeps showing this error :display:gsg:glgsg(error): Unable to detect OpenGL version

christianjans commented 3 years ago

@2017soft,

I can now run salloc and srun with X11 forwarding. However, it seems that sbatch does not support X11 forwarding. Is it possible to resolve this (hence running the training job for SMARTS using sbatch)?

I have been looking into this too, but am thinking that your assumption is correct in that by allocating resources through sbatch, X11 forwarding cannot be done. However, if this is needed, I would contact Compute Canada support (https://docs.computecanada.ca/wiki/Technical_support) to see if they have any solutions.

I wonder if it is possible to just turn off the DISPLAY.

I have talked to @Gamenot about this, and in future versions of SMARTS, this display issue will not be a problem if camera observations are not needed. However, if camera observations are required by agents, then rendering will be needed in order to produce these observations, and this rendering needs this X11 DISPLAY environment variable (not sure about the exact details here).

When I run the experiments using GPU, ...

I have been looking into running SMARTS/ULTRA using the GPU on Compute Canada as well, and in order to do so, I...

  1. Had to modify the first line of the Dockerfile to be FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 so that the image would have NVIDIA drivers.
  2. Rebuilt the Docker image with this new Docker base image.
  3. Uploaded this new image to my DockerHub.
  4. Created the Singularity container of this Docker image on the Compute Canada machine.
  5. salloc'd GPU resources by setting the --gres argument (see https://docs.computecanada.ca/wiki/Using_GPUs_with_Slurm).
  6. Ran singularity shell with the --nv flag (https://sylabs.io/guides/3.5/user-guide/gpu.html) on the GPU compute node.

If your process to get SMARTS running with GPUs on Compute Canada is different, can you describe what you did?

When I try and run a script that has an agent with a PyTorch network, it sometimes succeeds, but I also sometimes get that :display:gsg:glgsg(error): Unable to detect OpenGL version error too. I found that if I get this error, I can exit out of the current Singularity container I am in, then simply restart the Singularity container (with singularity shell --nv ...) and then it sometimes works, but other times still does not.

So I am not sure what is going on here with this error, I will try and look into it more. And again, Compute Canada support might have some ideas too.

2017soft commented 3 years ago

@christianjans I have already got SMARTS running with GPU on Compute Canada. I used more or less the same process except I used nvidia/cudagl:11.0.3-devel-ubuntu18.04 as the base Docker image. This has OpenGL installed and then we can install cudnn on top of the base image by using apt-get install -y cudnn8.

christianjans commented 3 years ago

Ahh okay great @2017soft, are you still encountering this error?

Also, just for-your-information, I have submitted #993 to hopefully get more information and help on running SMARTS with camera observations through a non-interactive job (i.e. when submitting jobs through sbatch).

2017soft commented 3 years ago

@christianjans Thanks a lot for submitting the issue for non-interactive Slurm jobs. I have emailed Compute Canada as well and got a response from them today:

Unfortunately, there is no way to run jobs with sbatch and X11 forwarding. You have to find a way to run your code in batch mode. If there is no way to disable this feature, your application is not suitable for batch mode and you have to use only interactive jobs via salloc.

It looks like it may not be possible to run sbatch jobs with X11 forwarding. As you mentioned in the new issue, I also wonder how possible it is to remove the X11 dependency from SMARTS.

Also, by using nvidia/cudagl:11.0.3-devel-ubuntu18.04 as the base image, I no longer have the OpenGL error :display:gsg:glgsg(error): Unable to detect OpenGL version here anymore. I can run the interactive training jobs without any issues.

christianjans commented 3 years ago

Okay great, thanks for confirming with Compute Canada support. And oh awesome, so glad you resolved the OpenGL error! 👍