potential bugs and improvements in v1.1. Encounter OOM when running sumulation.

Yihanhu commented 1 year ago

When I run simulations, it always consumes a lot of resources, especially memory, and threads. Why does it consume so much resources, and is it possible to optimize it?

I also noticed that the simulation model is always pre-loaded into memory according to different scenarios. For example, if we run a simulation of 1000 scenarios, 1000 models will be loaded into memory at the same time, resulting in huge memory usage at the beginning of the simulation. However, only a small portion of the simulation is actually running.

Moreover, when the number of simulation ray workers exceeds 4, the raylet will be killed due to out of thread on the machine. Therefore, on my machine, there can only be 4 simulations running at the same time, which results in a very slow overall simulation speed. This situation is completely different from that of Nuplan v1.0. In v1.0, the number of simulations running at the same time seemed to be unrelated to the number of ray nodes, so there could be many simulations running simultaneously on my machine, which is much much faster than right now in v1.1.

Is there a way to optimize the parallelism and resource utilization of the simulation to greatly improve its speed and help users better utilize servers with small memory for the aforementioned issues?

michael-motional commented 1 year ago

Hi @Yihanhu, these are both good points. Regarding simulation memory usage, we're planning to address it soon. For the degree of parallelism, the 1.0 version had some issue with OOM/overwhelming the machine with too many processes, which motivated the change. I'll take another look at how this can be balanced.

gianmarco-motional commented 1 year ago

@Yihanhu You can try use the single_machine_thread_pool worker instead of the ray one. You have to enable use_process_pool to have a multiprocessing behavior.

gianmarco-motional commented 1 year ago

Also the footprint of the model (in case of an ML Planner) should be minimal prior to the initialize call, which only happens at the simulation execution (not when we build the scenarios).

I'll check if anybody can look into it.

Yihanhu commented 1 year ago

@Yihanhu You can try use the single_machine_thread_pool worker instead of the ray one. You have to enable use_process_pool to have a multiprocessing behavior.

Thank you for your suggestion, I will give it a try.

Also the footprint of the model (in case of an ML Planner) should be minimal prior to the initialize call, which only happens at the simulation execution (not when we build the scenarios).

I'll check if anybody can look into it.

Yes, you are right. In simulation execution, memory will experience explosive growth.

polcomicepute commented 12 months ago

Has there been any progress in addressing this issue? I am unable to verify the performance of my model due to the inability to run simulations for various scenarios (when num_scenarios>2 for each scenario type, single node) How can this be resolved? I would appreciate it if could provide a solution.

jessapinkman commented 2 months ago

Has there been any progress in addressing this issue? I am unable to verify the performance of my model due to the inability to run simulations for various scenarios (when num_scenarios>2 for each scenario type, single node) How can this be resolved? I would appreciate it if could provide a solution.

hi, Have you solved this problem now? @Yihanhu @polcomicepute

motional / nuplan-devkit

potential bugs and improvements in v1.1. Encounter OOM when running sumulation. #237