osrf / subt

This repostory contains software for the virtual track of the DARPA SubT Challenge. Within this repository you will find Gazebo simulation assets, ROS interfaces, support scripts and plugins, and documentation needed to compete in the SubT Virtual Challenge.
Other
297 stars 99 forks source link

Cloudsim performance insufficient to run stable control loops #206

Closed osrf-migration closed 4 years ago

osrf-migration commented 4 years ago

Original report (archived issue) by GoRobotGo (Bitbucket: GoRobotGo).


Simple Tunnel 2 runs at about 2x realtime in the cloud. This would be fine if the control loop also ran 2x faster, but the control loop runs 10x slower in the cloud than on my local computer. With this large of a differential the control loop fails in cloudsim.

I will run an experiment where I try and maximally slow down the cloudsim (with lots of vehicles) and speed up my control loop to see if the differential can be reduced.

My current situation is that I can score locally using run_docker_compose.sh, but running successfully using cloudsim through the portal is not possible due to the performance issues.

osrf-migration commented 4 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


I was expecting the simulation to run at maximum 1x realtime, possibly slower, and that the solution docker would get the largest AWS instance all for itself. If this is not the case, I am not sure what is expected from the teams participating.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


If you run more than one robot, it looks like it will be at or below 1x realtime even with Simple Tunnel 2. I did an experiment with Simple Tunnel 2 and it ran at ~1x realtime with 2 vehicles. I removed all of my object identification processes / nodes to try and reduce the load and got my code running ~3x slower than on my local computer (vs ~10x above with the object identification).

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


I did an experiment with 5 vehicles and Tunnel Practice 1. It ran at ~1/7 of real time. The control loop seemed to be running fast enough, but the vehicle did not take off. (#208)

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


I can confirm that the performance of the solution container is not good. A function that needs on my local computer about 0.03 seconds simulation time needs about 0.6 seconds simulation time in cloudsim.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


We are planning to deploy a change that will limit the maximum performance to 1x realtime. Deployment will take place when no one is actively running a simulation.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


When I know it is deployed I will try some runs. I have been limiting the realtime performance to slower than 1x by adding multiple robots and that has not resolved the issues. What type of instance is the solution docker running on?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


We are using AWS g3s.xlarge. The specs are listed here.

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


How many of these instances are involved in one simulated run? Do I understand correctly https://osrf-migration.github.io/subt-gh-pages/#!/osrf/subt/wiki/cloudsim_architecture that each robot controller gets its own instance of g3s.xlarge and there only the solution container and the bridge container are run? Is the performance impact of the bridge significant or is more or less the whole instance available for the controller? Do you know what vCPU is? They only say it is “based on Intel Xeon 2.7GHz”. Is the GPU shared in any way with any other code/customer or is it dedicated full time to the controller?

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


This instance explains quite a bit of the performance problems. The CPU is listed as a custom Xeon E5-2686 v4. Two 16 or 18 core CPUs are present for a total of 32 cores / 64 threads running at 2.3-2.7GHz. The g3s.xlarge instance gets 2 cores / 4 threads. The other 14 cores/ 28 threads are running with other users.

Locally most people probably have 1.5 to 2x faster performance per core and 2x to 4x more cores/threads.

An easy solution to a good part of the performance issues would be to switch to the next tier up of g3s. With the current tier most solutions are thread/CPU starved.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


The bridge is lightweight. It only converts messages from to/from ROS and Ignition.

Each robot gets its own g3 instance.

All AWS machines are virtualized, they indicate this with the vCPU name.

We choose the g3s.xlarge because it's similar to the nvidia tx2, which has 2 cores. I'll check to see if an upgrade is possible.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


  1. There is the possibility of increasing instance size for future competitions.
  2. There is also the possibility of using different instance sizes for each robot configuration. For example, a large UGV could potentially have a larger AWS instance.

Keep in mind that these are possibilities for future events. We are one week from the close of Tunnel Circuit, which means significant changes are unlikely in order to limit risk and promote fairness.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


My expectation was that if it ran locally in cloudsim it would run in the cloud in cloudsim. It is disappointing to find out a week before that the contest that we were supposed to be targeting a slow 2 core processor.

osrf-migration commented 4 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


GoRobotGo (GoRobotGo) the CloudSim Architecture documentation was posted back in July and indicates that AWS G3 EC2 instances were going be used.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


I will admit I missed that in the documentation. A G3 instance ranges from 2 to 32 cores (4 to 64 threads). So, this documentation would not have been enough to know that we needed to target a slow 2 core processor. As I mentioned before, any of the other G3 instances would resolve most of the performance problems.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


We are in the process of bumping the cloud machine instance type to a g3.4xlarge, this is from the current g3s.xlarge. We are waiting on AWS to increase our EC2 availability limit. I'll follow up on this thread when the limit has been increased and we have updated cloudsim.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


The Portal has been updated to use g3.4xlarge machines.

Please re-open this issue if you encounter additional problems related to performance and AWS machine types.

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


After your email that the upgrade is complete, we uploaded a new simulation (for simple_tunnel_02) - it’s Simple02 under CYNET-ai, and it’s been in the “Pending With no errors” state for the past almost 3 hours. When we submitted another one this morning, it was terminated within an hour.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I started some simulations last night and NONE(!) finished that I could check the results :disappointed: (here are are talking about 6+ hours …1 is “Deleting Pods”, one is Launching, 3 are still pending

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


Looks like a new Issues was created to discuss this:

[https://osrf-migration.github.io/subt-gh-pages/#!/osrf/subt/issues/244/cloudsim-not-functioning (#244)](https://osrf-migration.github.io/subt-gh-pages/#!/osrf/subt/issues/244/cloudsim-not-functioning (#244))

osrf-migration commented 4 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).