hardware platform and runtime consumption

namsan96 / SiMPL

46 stars 5 forks source link

hardware platform and runtime consumption #5

Open feriorior opened 1 year ago

feriorior commented 1 year ago

Thank you for bringing new ideas to skill-based work.

I have found that collecting task instances in the code will consume a lot of time. I would like to inquire about your hardware platform and runtime consumption. Do you think it is possible to store buffers for reuse? Will it affect the overall architecture of this project.

namsan96 commented 1 year ago

Hi, thank you for your interest to our work!

There's no strict hardware requirement other than GPU accessibility, but better CPU resource will be helpful.
Please note that you can use multi-processing for collecting rollouts from environment in our meta-training code.
You can set --worker-gpus (ex.--worker-gpus 0,0,0,0,1,1,1,1 for 4 processes using GPU 0 + 4 using GPU 1).
Our meta-training generally took less than 24 hours when we used 5 ~ 8 processes in a machine with 20-cores server processor and 1 ~ 2 consumer GPUs (ex. RTX2080).

For the question regarding to the buffer, we are already using replay buffer to leverage collected data more efficiently.

BrightMoonStar commented 7 months ago

I found that the GPU utilization when running python simpl_meta_train.py is very low. Could you let me know why set trainer.policy.to('cpu') in cpu mode here? Can we set to GPU mode here to speed up the process？ Thank you very much！

namsan96 commented 7 months ago

ConcurrentCollector is already utilizing GPU acceleration for policy rollouts if you specify --worker-gpus. The highlighted line aims to save GPU memory usage. If we keep the policy on a GPU, let's say cuda:0, before sending it to workers, every worker processes will initiate CUDA kernels on cuda:0, even if the processes are assigned to other GPUs. To prevent this, the highlighted line will initially place the policy on the CPU and expect other processes to transfer the network to the allocated GPUs of each worker, thus initializing only one CUDA kernel on the allocated GPU.