Warp Drive PyCuda Error

mhelabd commented 2 years ago

I am currently running a training script using warp-drive.

I have my environment initialized in this dockerfile.

When running my training_script, I get the following error:

python training_script.py --env simple_wood_and_stone

Inside training_script.py: 1 GPUs are available.
Inside env_wrapper.py: 1 GPUs are available.
/home/miniconda/lib/python3.7/site-packages/torch/cuda/__init__.py:120: UserWarning:
    Found GPU%d %s which is of cuda capability %d.%d.
    PyTorch no longer supports this GPU because it is too old.
    The minimum cuda capability supported by this library is %d.%d.

  warnings.warn(old_gpu_warn.format(d, name, major, minor, min_arch // 10, min_arch % 10))
Initializing the CUDA data manager...
Initializing the CUDA function manager...
WARNING:root:the destination header file /home/miniconda/lib/python3.7/site-packages/warp_drive/cuda_includes/env_config.h already exists; remove and rebuild.
WARNING:root:the destination runner file /home/miniconda/lib/python3.7/site-packages/warp_drive/cuda_includes/env_runner.cu already exists; remove and rebuild.
Traceback (most recent call last):
  File "training_script.py", line 109, in <module>
    customized_env_registrar=env_registry,
  File "/home/miniconda/lib/python3.7/site-packages/ai_economist/foundation/env_wrapper.py", line 208, in __init__
    self.cuda_function_manager.initialize_functions([step_function])
  File "/home/miniconda/lib/python3.7/site-packages/warp_drive/managers/function_manager.py", line 330, in initialize_functions
    self._cuda_functions[fname] = self._CUDA_module.get_function(fname)
pycuda._driver.LogicError: cuModuleGetFunction failed: named symbol not found

was wondering if someone ran into this before or has any idea how to fix it?

Emerald01 commented 2 years ago

I think your running env looks good. The error is here, basically warpdrive does not find your environment step() source code in .cu so it cannot initialize your step function

self._cuda_functions[fname] = self._CUDA_module.get_function(fname)
pycuda._driver.LogicError: cuModuleGetFunction failed: named symbol not found

Inside the code, it happens here In the Foundation wrapper you had the following. You have to have a step kernel function called f"Cuda{self.name}Step" in a .cu source code file, and registered under env_registra with its absolute path.

           self.cuda_function_manager.compile_and_load_cuda(
                env_name=self.name,
                template_header_file="template_env_config.h",
                template_runner_file="template_env_runner.cu",
                customized_env_registrar=customized_env_registrar,
            )
            print("initialize_functions...")

            step_function = f"Cuda{self.name}Step"
            self.cuda_function_manager.initialize_functions([step_function])
            self.env.cuda_step = self.cuda_function_manager.get_function(step_function)

Please let me know if you have any problem, I am more than happy to help.

sunil-s commented 2 years ago

Thanks for your question, @mhelabd Adding on @Emerald01 's response For running the simple-wood-and-stone environment with WarpDrive, you would first need to create a CUDA version of the environment. To get started, please see our tutorial: https://github.com/salesforce/ai-economist/blob/master/tutorials/multi_agent_gpu_training_with_warp_drive.ipynb. That shows how to build and train your environment end-to-end with WarpDrive, and also points out nuances like how to name your GPU kernels.

Also, in your current training script, you are pointing to "../foundation/scenarios/covid19/covid19_build.cu", which only contains the paths to the source files for the covid and economy environment, but not the simple_wood_and_stone.

In fact, we do not yet have a CUDA C version of the wood-and-stone environment that can run on a GPU with WarpDrive. If you would like to contribute to that environment, we would love to add it to the repository. Happy to answer any other questions. Thanks.

salesforce / warp-drive

Warp Drive PyCuda Error #28