Question: What is simulation speed bottleneck?

peci1 commented 3 years ago

We tried to run the simulator on a beefy machine (40 cores, 4 GPUs) with a full team of robots (3 UGVs, 3 UAVs, approx. 30 cameras in total). Neither the CPUs nor any GPU were throttled to max, yet the real-time-factor was between 1-2 percent. Is there any clear performance bottleneck that could be worked on? E.g. aren't the sensors rendered serially (i.e. first one camera, then a second one, and so on)? Or is there something else? I'm pretty sure the physics computations shouldn't be that costly (and the performance doesn't drop lineraly with the number of robots).

AravindaDP commented 3 years ago

I'm also interested in same question. Is it possible to achieve similar level of performance of cloudsim using docker compose and if so what kind of machine should we use (I'm primarily targeting a AWS EC2 instance)

My experience has been as follows. (Numbers could be slightly off as these were from my general memory as I remember) For a single X2 UGV with 3D lidar+4rgbd/UAV with 2d lidar+rgbd+2point lidar Local PC with i7 7th gen 4C/8T @2.8GHz + GTX 1060 Realtime factor is around 30% Cloudsim 40~50%

For 2x UGV (Same as above)+2x UAV(same as above)+Teambase Local PC: 2~3% Cloudsim: 10%

This local performance is even without solution containers/just using ign launch on catkin without any of solution nodes. Haven't tried using in headless mode also yet.

I haven't still tested using docker compose on amazon EC2. I understand that even then, it won't be apple to apple scenario since in cloudsim, simulation container and solution containers are potentially running on different EC2 instances.

In general I'm looking for following information, if it's possible to know.

What is the instance type of host EC2 (p3.8xlarge etc) used to run simulation docker container. How much vCPU/GPU/RAM does that container get?
What is the instance type of host EC2 used to run bridge containers and solution containers. How much vCPU/GPU/RAM does each container get? (Or how many docker instances are run on each EC2 instance)

Probably it's not a straightforward relationship between EC2 instance and container count as it's dynamically managed kubernetes cluster spanning across multiple nodes. Just looking for some rough figure so that I could try to recreate using docker compose on AWS. (Probably with even befier EC2 instance that has power equal to combined EC2 instances required for cloudsim simulation. assuming there exist such beefier single EC2 instance type)

peci1 commented 3 years ago

@AravindaDP Most of your questions have answers in https://github.com/osrf/subt/wiki/Cloudsim%20Architecture . This bug is however about finding the bottlenecks on systems that have enough resources. Your local PC tests very probably suffer from resource exhaustion...

AravindaDP commented 3 years ago

@peci1 Thanks for pointing out the resources.

pauljurczak commented 3 years ago

Here are the CPU specs for EC2:

Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz: 
#1-16  2660 MHz
99.72Gi free RAM
119.99Gi total RAM

AravindaDP commented 3 years ago

Not sure if it has any effects but seems some 3D lidars have way too much horizontal resolution than real sensor they are based on.

e.g. For X1 Config 8 and EXPLORER_X1 Config 2 it has 10000 horizontal points per ring. https://github.com/osrf/subt/blob/master/submitted_models/explorer_x1_sensor_config_2/model.sdf#L580

Where as VLP-16 (which I believe based on which these were modeled) would only have 1200 horizontal points per ring at 15Hz http://www.mapix.com/wp-content/uploads/2018/07/63-9229_Rev-H_Puck-_Datasheet_Web-1.pdf

Probably this is not the actual bottleneck but I guess still needs correction.

May be the culprit is camera sensor? I guess I see a difference between COSTAR_HUSKY and EXPLORER_X1 (Single RGBD vs 4x RGBD) is it CPU based rendering? Could it be made to use GPU?

zwn commented 3 years ago

According to my tests, the simulation does not use more than 4 CPU cores. As for GPU, most of the usage is in the gui - I do all local runs headless - if I don't, the GUI takes all available GPU memory (8GB in my case) and nothing else works on the computer running the simulation.

Not knowing anything about the actual implementation, I am also surprised by the drop in performance when simulating multiple robots. From my (possibly naive) point of view (considering current games) the resolution of the cameras are small and the requirements on the quality are not that big either. We are using mostly 640x480 cameras which is 0.3MP. It seems 1920x1080 (2.1MP) is the minimum current games use and the FPS starts at 60Hz - so pixel-wise the ratio is 6.75x and fps-wise 3x (at minimum). Given this comparison it should be possible to get about 20 cameras at 480p resolution at 20Hz real time while in reality we get only about maybe 3% of that.

So yes, I'd also like to know where the bottleneck is. It is really difficult to get anything done at 3% of real time.

peci1 commented 3 years ago

GUI takes all available GPU memory (8GB in my case) and nothing else works on the computer running the simulation.

How many robots are you talking about? I get like 800 MB GPU memory for the GUI with a single robot, and it seems to scale more or less linearly with more robots.

We've actually found out that the Absolem robot is quite a greedy-guts regarding cameras - the main 6-lens omnicamera sums up to something like 4K... So I wouldn't wonder it takes some time to simulate that, but I wonder why the GPU isn't fully used. Or maybe it's just because of the way nvidia-smi computes GPU usage? I know there are many different computation/rendering pipelines in the GPU...

zwn commented 3 years ago

I have re-run the test. Currently when running headless simulation with as single X2 robot there is one ruby process taking about 2GB of memory and the GPU utilization stays around 5%. When running the same setup with the gui, there are two ruby processes each taking 2GB but the GPU utilization jumps to 100% and even the mouse movement is slowed down (even when the window is not visible). So in my book the gui is still broken for me and I'll continue running headless.

I have ubuntu 18.04 system with nvidia driver 450 and GeForce GTX 1050 with 8GB.

peci1 commented 3 years ago

Was this test performed via Docker or in a direct catkin install?

zwn commented 3 years ago

Docker

peci1 commented 3 years ago

Could you re-do the test with a direct install? Because I'd like to clearly separate performance loss brought in by Docker from the performance of the simulator itself. When I run the simulator directly and headless, there is no noticeable slowdown on my 8th gen core i7 ultrabook with external GPU (as long as there is a single robot with not that many cameras).

zwn commented 3 years ago

Could you re-do the test with a direct install?

Actually, sorry, no. I am not going to risk messing up the whole computer with installing all the ros and ign stuff directly into the system I depend on. However rviz is working just fine from inside the docker, taking a hardly noticeable hit on the GPU utilization when displaying an image from the front camera and depth cloud at the same time - taking only 18MB of GPU memory - that aligns more with my expectations.

zwn commented 3 years ago

We might get some improvement in speed by building the plugins in this repository with optimizations enabled. See #688

dan-riley commented 3 years ago

The simulation speed is clearly related to the RGBD cameras. I have modified our models to run without the cameras and only LIDAR, and two robots can be run at about 60-80% realtime. If the same models are used, but with the cameras enabled, the same two robots run at about 20% realtime. Our models use a 64 beam LIDAR versus the 16 beam present on most systems, so a higher resolution LIDAR does not seem to impact performance much.

peci1 commented 3 years ago

I agree the speed goes down very much with cameras. However, I wonder why the computer doesn't utilize more resources in order to keep it running as fast as possible.

GPU lidar is basically just a depth camera in Gazebo - resolution would be something like 2048x64. So the preformance impact would be hard to notice (even more with the 16-ray ones).

AravindaDP commented 3 years ago

My knowledge on ignition gazebo is very limited so take my following observations/hunches with a pinch of salt.

I believe ignition gazebo renders camera serially so that might explain why we don't see increase in resource utilization with respect to number of robots/cameras.

I'm also curious about usage of manual scene update in RendoringSensor (base class of all cameras as I understand but also of Gpu lidar) https://github.com/ignitionrobotics/ign-sensors/blob/main/src/RenderingSensor.cc#L89 Probably something that can be optimized? But that should affect gpu lidar also if it has any effect.

I think best way to analyze where is bottleneck is to use profiler to analyze what's taking time. https://ignitionrobotics.org/api/common/3.6/profiler.html

AravindaDP commented 3 years ago

I can confirm that using a release build of subt_ws as suggested in https://github.com/osrf/subt/pull/688 makes a noticeable improvement in performance. (In my case approx. 2x speed up for single X2C6 in GUI mode)

pauljurczak commented 3 years ago

@AravindaDP How did you pass compiler flags and build parameters to catkin?

AravindaDP commented 3 years ago

@pauljurczak I just used catkin_make -DCMAKE_BUILD_TYPE=Release install as last command in step 4 here https://github.com/osrf/subt/wiki/Catkin%20System%20Setup

zwn commented 3 years ago

@pauljurczak See #688.

pauljurczak commented 3 years ago

Thank you. I rediscovered that cmake-gui works with this project, if launched from command line. It makes editing configuration options much easier.

peci1 commented 3 years ago

This might help a lot: https://github.com/ignitionrobotics/ign-sensors/pull/95 . I created a PR that would add a service set_rate to each sensor in the simulation. By calling this service, you can selectively decrease the update rates of the sensors. E.g. the realsenses on EXPLORER_X1 run on 30 Hz, but we process them at 6 Hz. That's 80% images we just throw away. So let's not even render them!

Thanks for the idea @tpet !

peci1 commented 3 years ago

The ign-sensors PR has been merged and a new version was released in binary distribution. Now, #791 contains the required SubT part through which teams will be able to control rendering rate of sensors via ROS services. Even before #791 is merged, you can already set the rate in locally running simulations by directly calling the Ignition services (ign service -l | grep set_rate to get the list of services, and then you call each of them with the desired rate).

peci1 commented 3 years ago

With #791, I achieve 50-70% RTF with EXPLORER_X1 (3D lidar + 4x RealSense).

Space-Swarm commented 7 months ago

Hi all, just bumping this issue as I'm using the simulator and encountering similar problems with a speed bottleneck. Is there a summary somewhere of the options to speed up the simulation?

I have access to a super computer, but it requires a lot of specialist software to setup, and I've learnt from @peci1 that a supercomputer may not resolve the bottlenecks. The old version of gazebo with Ignition Dome does not work well for parallel processing, so I'm interested in finding out if Gazebo Fortress or any other fixes were used by the competing SubT teams. I'm looking for a way that I can use multiple cores to speed up the simulation.

osrf / subt

Question: What is simulation speed bottleneck? #680