real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.1k stars 206 forks source link

Question: virtual environment rendering/acceleration #17

Open AlbertTan404 opened 10 months ago

AlbertTan404 commented 10 months ago

Hi there! Thanks for your impressive work and beautiful code :) I tried to run lift_image_abs with transformer hybrid workspace HEADLESS, but it logged that:

[root][INFO] Command '['/mambaforge/envs/robodiff/lib/python3.9/site-packages/egl_probe/build/test_device', '0']' returned non-zero exit status 1.
[root][INFO] - Device 0 is not available for rendering

and it keeps repeating on all of the 4 GPUs. Afterwards, I found the "Eval LiftImage" process is really slow, I wonder if I should turn on or install some driver for hardware acceleration?

nvidia-smi command during Eval (GPU-Util keeps 0%): image

top command during Eval: image

wandb monitor data: image

cheng-chi commented 10 months ago

Hi @AlbertTan404, in my experience the eval process is CPU bound, therefore I'm surpsied the find low CPU usage on your system duing eval. I don't have experience dealing with this problem, but I suspect most of the time is spent inside robomimic enviornments.

AlbertTan404 commented 10 months ago

Hi @AlbertTan404, in my experience the eval process is CPU bound, therefore I'm surpsied the find low CPU usage on your system duing eval. I don't have experience dealing with this problem, but I suspect most of the time is spent inside robomimic enviornments.

Thanks for your reply. I'll take a look into the inference process in robomimic env.

cheng-chi commented 10 months ago

Hi @AlbertTan404, I recently encoutered similar issue on my machine as well. It turns out to be a bug in recent version of pytorch when installed through conda. https://github.com/pytorch/pytorch/issues/99625 This bug will cause all subprocesses created after import torch to inherit the same CPU Affinity to the first CPU core, which cuases all of dataloader workers and robomimc env workers to be squeezed to the same CPU core, drastically decreasing performance. As described in the pytorch issue, the solution is: conda install llvm-openmp=14

You can check if you are affected by running this script:

import multiprocessing as mp

def print_affinity():
    import time
    import psutil
    p = psutil.Process()
    print('before import torch', p.cpu_affinity())
p = mp.Process(target=print_affinity)
p.start()
p.join()

import torch

def print_affinity():
    import time
    import psutil
    p = psutil.Process()
    print('after import torch', p.cpu_affinity())
p = mp.Process(target=print_affinity)
p.start()
p.join()

This is the result on my machine before and after the fix:

Screenshot 2023-09-12 at 23 00 52 Screenshot 2023-09-12 at 23 02 48

I will pin llvm-openmp version in this repo as well.

AlbertTan404 commented 10 months ago

great thanks! I found it significantly boosts the evaluation process. btw, it takes long time in my machine for conda install llvm-openmp=14, while mamba install llvm-openmp=14 works better.

cheng-chi commented 10 months ago

@AlbertTan404 Great! I want to keep this issue open so that other people can find it as well.