osrf / subt

This repostory contains software for the virtual track of the DARPA SubT Challenge. Within this repository you will find Gazebo simulation assets, ROS interfaces, support scripts and plugins, and documentation needed to compete in the SubT Virtual Challenge.
Other
305 stars 98 forks source link

Problems with Cloudsim #339

Closed osrf-migration closed 4 years ago

osrf-migration commented 4 years ago

Original report (archived issue) by Sophisticated Engineering (Bitbucket: sopheng).


We have problems with cloudsim on the portal.

Tests from last night are running very long and have the Status “Deleting Pods”. And some have additionally “Error: Admin Review”.

I’ve send an Email to subt-help.

Is this problem also seen by other teams?

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


Yes. Reminds me of last time.

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


It seems CloudSim was under a very heavy load last night. Please try resubmitting a couple of your runs.

osrf-migration commented 4 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


I just uploaded my images like 30 min ago and got the same Error:Admin Review. Now it changed to “Error: InitializationFailed”. My images run ok with docker-compose, is this due to the loading in the cloud?

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


I see the same behavior here

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


The same problem (Terminated, Error: InitializationFailed) for all robotika latest simulations (ver55). I would almost change the priority from “major” to “blocker”. We are trying to create workaround for missing messages, and there is no way how to test it now …

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Does anyone see any update on this? My recent runs seem to have been restarted. Most are pending, one is LaunchingPods. That one only has real-time logs (nice feature btw) for one robot, which shows:

ROS_MASTER_URI=http://10.46.56.2:11311
]2;/home/developer/subt_ws/install/share/subt_ros/launch/x2_description.launch http://10.46.56.2:11311
No processes to monitor
shutting down processing monitor...
... shutting down processing monitor complete

Not sure that’s a good sign…

[Update] it failed, admin review

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


I believe part of the Initialization failure/error messages is based upon submitting runs in rapid succession. Please try resubmitting in a slower manner and space your simulations submissions apart. I don't have a good answer for how long to wait between submissions but waiting to submit subsequent submissions until the first submission is Running should be sufficient. If you're using the CLI and submitting a handful of submissions at once, please add some arbitrarily lengthy sleep (5 minutes is a good spacing) between each submission for a temporary hands-off fix or monitor your submissions on the web portal and submit subsequent simulations after the prior has the status Running.

This is related to an issue bumping into an AWS limit on CloudSim with subsequent loading of a large number of simulation requests at a single time. We're working on a more permanent fix that does not require competitors to alter their submission workflow.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


Two of our tests are meanwhile “running “ . The others are “Terminated GazeboError”

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


GazeboError is a different issue, please re-run the experiments.

osrf-migration commented 4 years ago

Original comment by Sarah Kitchen (Bitbucket: snkitche).


Arthur Schang (Arthur Schang) It appears that there is an auto-restart process that happens in some cases - specifically, I observe in some cases when I receive an error, but the state is not given as Terminated, the run will relaunch after a few hours. This is really nice to have, but could it also be affecting how often we are seeing the Initialization Failures in the last 24 hours?

osrf-migration commented 4 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Mine restarted but again went into Admin Review

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


robotika

1/30/20, 5:01 PM

ver55p3

f0da93a2-1f41-4398-be24-28d6cdbaa975-r-1

DeletingPods

Error: AdminReview

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


Sarah, I am not aware of the logic behind the restart process. I will defer to someone else to answer that question. If a batch of runs are all restarted at the same time it will almost certainly result in another initialization error/failure at the moment.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


I can confirm I had 6 runs restart (all at the same time, though 5 were initially “pending”) and all 6 failed with initialization error/admin review. Now that those are all done I’m going to carefully try just one, hoping for the best…

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


The portal is now displaying “Unknown Error” for me and is not displaying any simulation results. Is anyone else seeing that? I tried logging out and in again.

I was just going to start a new simulation but maybe I’d better wait a bit first.

[Edit] Looks ok now, maybe just a temporary server issue

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I see that too!!! No way to submit anything :disappointed:

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


Same here. Don’t worry, it’s still gonna be Jan 30 somewhere on Earth for the next 12 hours, so keep an eye on it, load up on Red Bulls and don’t plan any sleep.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Good plan :slight_smile: gotta love the caffeine

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


For future circuits, would a submission process that allowed for multiple submissions for the final circuit solution be something that would ease tensions around the deadline? That would allow for a competitor to submit intermediate solutions before submitting their final solution. In this case, if something drastic did happen, your submission would fall back on your intermediate solution.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Arthur Schang (Arthur Schang) Yes, I believe that would be a good change

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


Yes, definitely

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Yes, please!

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


I had the unknown error yesterday see issue #340. It disappeared after about an hour.

Allowing multiple submission fotr the final solution would be a really good feature!

osrf-migration commented 4 years ago

Original comment by Chris Fotache (Bitbucket: chrisfotache).


My final submission shows:

DeletingPods
Error: AdminReview

Is this expected behavior?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


A submission may go through many different states, and you should not worry. If we hit an AWS limit, then we'll retry. Same goes for a gazebo crash. Please wait for your email summary.

osrf-migration commented 4 years ago

Original comment by Sarah Kitchen (Bitbucket: snkitche).


Should we refrain from doing tests with practice runs in CloudSim while scoring for the Urban Circuit is underway? I.e. are we at risk of overloading the system and affecting scores?

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


You are clear to continue testing. If issues do arise, we will communicate with teams and take steps to ensure all runs are consistently scored.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Arthur Schang (Arthur Schang) are you sure this is good idea? Note, that all solutions degrade under heavy load (see https://osrf-migration.github.io/subt-gh-pages/#!/osrf/subt/issues/261/cloudsim-stops-sending-some-topics (#261)#comment-55929641) and I believe that this was also issue during Tunnel Circuit finals. It would be more fair to close testing for the next two weeks (?) period and run the contest solution only. Sequentially. thanks

p.s. I sent similar question as Sarah to subt-help@ yesterday …

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


I am aware of the situation and results raised in issue #261. If CloudSim practice runs are to be temporarily discontinued during UC evaluation, a formal announcement or infrastructural block on continued submission of practice runs will be issued.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


OK, I checked the results from SubT Virtual Urban (BTW thanks for the new feature robot_paths.svg :slight_smile: ), and it looked fine for the first 4 runs but then 20 runs were bad (the robot did not leave base station area, 5th run “on the edge”). I posted some pictures on:

https://robotika.cz/competitions/subtchallenge/virtual-urban-circuit/en#200305

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Martin Dlouhy (robotikacz) Thanks for sharing this. I actually see the same thing in 21 out of my 24 runs, along with the error in issue #348.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


@Malcolm Stagg yesterday I tried Urban Worlds available on CloudSim (I picked world 3 and 8 because our X2 does not like railway very much) and scored some points. I tried old submitted version as well as new “identical” with tweaks needed for System Track (we use the same code base), that we did not break Virtual for the last crazy month. It worked so I wanted to propose you to test it there too, but I see “SODIUM – 24 Robotics” on all leaderboards so you already tried that …

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Thanks Martin, I’ve been running my submitted version to find an unofficial score for how I would’ve done without those runs getting nuked, and also to get some decent log files. It’s still running (just doing 2 at a time to be safe) but so far looks like it’ll be somewhere around 60-70 points. I’ve been in touch with the SubT team and they are investigating the simulator irregularities, though they contend that simulator load is not a cause because the CloudSim design keeping runs isolated should make that impossible. I agree with them in theory, but the evidence seems to show otherwise.

Congrats to you and your team for doing so well despite so many of your runs getting nuked too!

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


We too had the problem that in many runs robots did not move. Until now I analyzed one run in more detail. Two of the robots did not move at all although the controller was running fine. Looks strange.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


@sopheng That sounds very similar. In most of my cases, the robots were either not moving at all or were “dancing” in circles outside the entrance. Most of the time all 4 robots were affected, but in one or two cases I believe 3 robots were seriously affected while one robot was able to enter a little before experiencing the same problem.

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


Could this also have the same root cause as #354? On a busy network I can see how some requests easily take longer than 100ms (we are not the only users of the network). It would also support the hypotesis that it gets worse as the load goes up. The symptoms also line up - some robots start ok and some do not (where the subscription took longer than 100ms).

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Zbyněk Winkler (Zbyněk Winkler (robotika)) That was my first thought too when I saw the 100ms issue.

osrf-migration commented 4 years ago

Original comment by Brett Fotheringham (Bitbucket: Bfotheri).


Hey Malcolm Stagg (malcolmst7) , did you ever figure that out a root cause for that dancing behavior? We experienced the same thing on most of our runs.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Brett Fotheringham (Bfotheri) I believe the SubT team is still investigating this to try to figure out what happened. There are a lot of new logs added now which should help if it happens again, but afaik no one has been able to reproduce this behavior since urban circuit.

Interestingly though the new logging work is actually causing something like this to happen again now (at least for me), not sure if that might help to find the root cause.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


If I had to guess I’d say it was somehow related to network behavior in AWS under heavy load, but so far nothing conclusive. In my limited logs, the only things I saw was that a lot of TF data was getting lost, and occasionally the /clock time jumped forward.

malcolmst commented 4 years ago

@bfotheri FYI, I'm fairly sure I finally did discover the root cause of this issue last week and shared it with the SubT team.

CloudSim uses WeaveNet to provide networking and network isolation to the containers. This works as expected for unicast packets, but WeaveNet NPC has some counter-intuitive behavior that it does not block multicast packets from being sent to all hosts (even other competitors), as detailed in this issue: https://github.com/weaveworks/weave/issues/3272. Ignition Transport uses multicast UDP packets for topic advertisements among other things, so all topic advertisements from any running robot were unexpectedly being sent to all hosts, where most of the topics would get filtered out on the receiving end. I was able to confirm this was the case by running two simulations simultaneously and logging all the topics before they were filtered out, and I found the isolation between those two simulations was indeed broken.

I also wrote some test code: https://github.com/sodium24/cloudsim_net_test which confirmed that when enough topic advertisements are received (on the order of several thousand), even though most are eventually filtered out, the infrastructure containers stop working correctly, and topic data, especially tf data, begins to become lost. This can be seen in the ign_saturation test cases. I was able to reproduce the issue of my robots either not moving or spinning in circles outside the entrance on my local computer. Anyone feel free to use/update this test code btw. There are also some test cases there which might be helpful to ensure things are working alright despite some common network impairments.

Based on this strong evidence, I believe the Urban Circuit issues were caused by the excessive rate of multicast Ignition Transport packets unexpectedly being delivered to all hosts, resulting from the WeaveNet issue combined with the large number of simulations and robots running at the same time.

The SubT team let me know that they are investigating this on their end. I haven't heard any update since then, but I see from the CloudSim Web repo that an update was made last week which will allow multicast to be disabled and replaced with unicast, which should fix this issue, so I expect we'll hear some update before too long.

zwn commented 4 years ago

@malcolmst Kudos.

Now we only need to know which version is running on cloudsim which is actually impossible for us to find out :worried:. I very doubt it is the release announced today since the PR it links to is not merged yet. However, the images at https://hub.docker.com/r/osrf/subt-virtual-testbed/tags seem to have been updated recently. Too bad that #377 is not addressed yet, or even better #256. Or maybe #193 anyone?

bfotheri commented 4 years ago

@malcolmst an additional Kudos. Thank you for your investigation. You absolutely nailed it. I feel like there are a number of workaround/fixes but we'll see what OSRF says. Thanks again!!

angelacmaio commented 4 years ago

@malcolmst Thank you for the investigation. We are now blocking multicast and using IGN_RELAY with this PR. Ignition Transport topic advertisements will no longer be available to all instances. This fix has been deployed to Cloudsim.

The bridge can be stressed by simultaneous advertisement of many Ignition Transport topics. ~10K topic advertisements is an order of magnitude larger than what occurs on Cloudsim even during the largest usage spikes.

With your test case, we can reproduce the circling robots but have not seen indications of data loss (as recorded by the new bridge logger). Note that the subt_seed solution contains a blocking service call that is affected by stress on Ignition Transport, but the solution may be altered to respond well despite longer service calls (see the fix here). We are also testing this PR to reduce topic bandwidth, which may improve the timing consistency.

We are continuing to stress test the infrastructure and the sample subt_hello_world solution and hope to have additional recommendations soon.

malcolmst commented 4 years ago

Thanks @zwn and @bfotheri, appreciate it! It sure wasn't obvious to find, I searched everything else I could think of for possible network issues, etc, then finally just found it hiding there in plain sight. And thank you for your update, @angelacmaio, glad to hear the IGN_RELAY fix has been deployed.

nkoenig commented 4 years ago

Cloudsim has seen many improvements. I'm closing this issue in order to triage the issue tracker. Please create a new issue if you see more problems with Cloudsim.