shacklettbp / madrona

MIT License
321 stars 31 forks source link

Invalid argument during initCuda on sim re-initialization [GPUDrive] #36

Closed aaravpandya closed 3 months ago

aaravpandya commented 3 months ago

Hi, There is an ask of GPUDrive to support hyperparameter sweeping. The ask is to be able to initialize the sim multiple times in different processes as part of a wandb sweep. But we are getting the following error - Error at /home/aarav/gpudrive/external/madrona/src/mw/cuda_exec.cpp:260 in void madrona::setCudaHeapSize() invalid argument

I am guessing the issue is in the line REQ_CUDA(cudaDeviceSetLimit(cudaLimitMallocHeapSize, heap_size)); where cudaDeviceSetLimit is being called again during `initCUDA'. Inferring it from the documentation here.

Easiest way to reproduce this error in GPUDrive would be to run pygpudrive/env/env_torch.py and add another sim init at the end with

    del env 

    env = GPUDriveTorchEnv(
        config=env_config,
        scene_config=scene_config,
        max_cont_agents=MAX_NUM_OBJECTS,  # Number of agents to control
        device="cuda",
        render_config=render_config,
    )

I am not sure if this requires changes in madrona or we could do something in GPUDrive. Please let me know what the options are.

Thanks

shacklettbp commented 3 months ago

I'm a bit confused because multiple copies of Madrona in different processes should work (especially on multiple GPUs).

The repro you suggest sounds like multiple copies of Madrona in the same process. There are multiple issues here. Like you said, in current GPUDrive I think this will call some initialization code multiple times (initCUDA in particular) that should only be called at process start. It would be straightforward for GPUDrive to expose the initCUDA call to Python separately.

The bigger issue is that Madrona doesn't properly clean up all memory on deinitialization. So that call to del env in your example will leak a ton of memory. This is a known issue in Madrona, that I haven't had time to fix. It's not technically challenging but will require a pretty thorough sweep through the Madrona codebase ECS routines, ensuring everything is cleaned up when the executor is destroyed.

aaravpandya commented 3 months ago

You are correct. I am able to make multiple copies of Madrona in different processes. I am closing this issue. I was under the impression that wandb initializes the sweeps in different processes (not unless we use CLI).

Thank you for the help. I'll look into exposing initCUDA separately and evaluate if that is indeed something that will improve the GPUDrive experience. With the memory leaks, I assume it would be not very practical.