osrf / subt

This repostory contains software for the virtual track of the DARPA SubT Challenge. Within this repository you will find Gazebo simulation assets, ROS interfaces, support scripts and plugins, and documentation needed to compete in the SubT Virtual Challenge.
Other
309 stars 97 forks source link

AWS "Initialization failed" #268

Closed osrf-migration closed 4 years ago

osrf-migration commented 5 years ago

Original report (archived issue) by Martin Dlouhy (Bitbucket: robotikacz).


Hello,

I tried to run new “ver26urbanS1” on “Simple Urban 1” world (5x X2 robots, robotika, 5d85f584-2f9e-444e-afda-30361336a26f) and it ended with “Terminated Error: InitializationFailed”.

What is recommended procedure in such case?

I tried locally single X2 robot and it works fine.

Would it be possible to add “Re-try” button? Something like in Jenkins “Rebuild” where you can edit parameters would be even better.:slight_smile:

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I re-tried with Urban Qualification world with the same result :disappointed:

(robotika, ver26urbanQb, 049839f8-8e47-44c6-8511-689cbae1ffad, Terminated, Error: InitializationFailed)

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


p.s. what changed on our side is that now we are subscribing to /clock

osrf-migration commented 5 years ago

Original comment by Hector Escobar (Bitbucket: hector_escobar).


Make sure you have any visualization off. Like no plots from Rviz. We saw that happening in the past when an image had plots enabled.

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


thanks for hint - but I am not using ROS except subscribing to topics and newly /clock

can this be related to cloudsim update? I will check old ver25 later on, thanks

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


it looks like it is somehow related to new solution docker image (ver26) because experiment with old (ver25) just passed on “Simple Urban 02” … hmmm

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Can we have access to AWS log files? This feedback is very slow :disappointed: … as I mentioned what changed is usage of /clock and in other topic (#261) there are discussions that maybe clock was the problem during finals of Tunnel Circuit. The modified code is here:

https://github.com/robotika/osgar/commit/99ad3e16204075f58d1355d0bf0dc036e9ca8f22

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Note, that I can download the ROS stdout logs (both CloudSim as well as both robots) and I see that for example clock was subscribed:

[rosmaster.master][INFO] 2019-11-09 08:21:52,976: publisherUpdate[/clock] -> http://10.40.160.2:40507/ ['http://10.40.160.2:46267/']
[rosmaster.master][INFO] 2019-11-09 08:21:52,977: publisherUpdate[/clock] -> http://10.40.160.2:45579/ ['http://10.40.160.2:46267/']
[rosmaster.master][INFO] 2019-11-09 08:21:52,977: publisherUpdate[/clock] -> http://10.40.160.2:40507/ ['http://10.40.160.2:46267/']: sec=0.00, result=[1, '', 0]
[rosmaster.master][INFO] 2019-11-09 08:21:52,978: publisherUpdate[/clock] -> http://10.40.160.2:46267/ ['http://10.40.160.2:46267/']
[rosmaster.master][INFO] 2019-11-09 08:21:53,000: publisherUpdate[/clock] -> http://10.40.160.2:46267/ ['http://10.40.160.2:46267/']: sec=0.02, result=[1, '', 0]
[rosmaster.master][INFO] 2019-11-09 08:21:53,006: publisherUpdate[/clock] -> http://10.40.160.2:45579/ ['http://10.40.160.2:46267/']: sec=0.03, result=[1, '', 0] 

but the ROS stdout is almost empty (example robot A):

0.a527ae8c41762a101acbf2474382f82acd977df2  Node Startup
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /rosbag_record_console [/tmp/binarydeb/ros-melodic-rosbag-1.14.3/src/recorder.cpp:210(shared_ptr<ros::Subscriber> rosbag::Recorder::subscribe)] [topics: /rosout] Subscribing to record
54.960000000 INFO /rosbag_record_console [/tmp/binarydeb/ros-melodic-rosbag-1.14.3/src/recorder.cpp:394(Recorder::startWriting)] [topics: /rosout] Recording to /home/developer/.ros/rosout_agg.bag.
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /tf_world_static [/tmp/binarydeb/ros-melodic-tf2-ros-0.6.5/src/static_transform_broadcaster_program.cpp:142(main)] [topics: /rosout, /tf_static] Spinning until killed publishing world to urban_qual 

osrf-migration commented 5 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Does your solution work using docker-compose?

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Locally I use 3 docker solution - suggested by Alfredo when I moved to Docker Ignition, i.e. CloudSim, Bridge and Solution. What is the difference when docker-compose is used?

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Martin,

Just for clarification. Running CloudSim + Bridge + Solution, it is just one of the suggested scenarios for development.

In preparation for a Cloudsim submission, we encourage teams to test/run their solution using Docker-Compose as the final step before they submit their solution(s) into the Cloudsim portal. It provides you with the closest `local` scenario to the real Cloudsim environment and it allow you to visually spot any last minute issues since it generate scores and logs locally.

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


OK, I will give it a try. For development I work inside the Solution docker (typically running git pull, upload OSGAR logs, etc.) and in other two consoles I have CloudSim and Bridge which I periodically restart. Only before the final docker image completion and submission I run it also “externally”.

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


OK, I test it and it worked fine, as with 3 dockers …

...
solution1_1  | [ INFO] [1573748402.389528967, 44.892000000]: Python3: stdout 976000 b'\x19\xcd\x18\x18\xcd\x18/\xcd\x1f7\xcd\x1f7\xcd\x19\xd0\x
cd\x19\xc2\xcd\x19\x1f\xcd\x18\xb8\xcd\x18\xab\xcd\x18\xd3\xcd\x18\xd1\xcd\x18\xe7\xcd\x19\x14\xcd\x19\x16\xcd\x19\x1f\xcd\x199\xcd\x19S\xcd\x1
9l\xcd\x19\x88\xcd\x19\x8f\xcd\x19\xbb\xcd\x19\xb9\xcd\x19\xe9\xcd\x19\xfe\xcd\x1a\t\xcd\x1a&\xcd\x1aE\xcd\x1a\\\xcd\x1a{\xcd\x1a\xa9\xcd\x1a\x
ad\xcd\x1a\xda\xcd\x1b\x04'                                                                                                                    
solution1_1  | [ INFO] [1573748402.389555794, 44.892000000]: Python3: stdout 976100 b'\xcd\x1b\x06\xcd\x1b(\xcd\x1bG\xcd\x1bt\xcd\x1b\x85\xcd\x
1b\xa8\xcd\x1b\xd1\xcd\x1b\xfb\xcd\x1c\x08\xcd\x1c5\xcd\x1cc\xcd\x1ct\xcd\x1c\xa2\xcd\x1c\xd8\xcd\x1c\xed\xcd\x1d\x1f\xcd\x1dD\xcd\x1di\xcd\x1d
\xa0\xcd\x1d\xb2\xcd\x1d\xeb\xcd\x1e\x1e\xcd\x1e.\xcd\x1e\x7f\xcd\x1e\xb8\xcd\x1e\xd9\xcd\x1f\x0c\xcd\x1f;\xcd\x1fn\xcd\x1f\x9c\xcd\x1f\xe3\xca
X\xb7\x03\r\x00\x01'                                                                                                                           
^CGracefully stopping... (press Ctrl+C again to force)                                                                                         
Stopping docker_solution1_1 ... done
Stopping docker_bridge1_1   ... done
Stopping docker_sim_1       ... done

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


(there is still dump of OSGAR log file into stdout, so that was reason why I killed it by Ctrl+C)

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I added simple “unittest” for subscription of topis including /clock:

https://github.com/robotika/osgar/tree/feature/virtual-ignition/subt/docker/unittest
It is based on subt-seed example/docker image. It only counts incoming messages and every 100 it prints the count (except clock which is every 1000). Now it is running with single robot (the command is (0, 0), i.e. not moving.

We will see.

p.s. “robotika, unittest1, 9067dcb9-ec62-42c5-b271-cb035c58d6b7”

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Terminated with no errors … hmm, looks good … hmm

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


0.a527ae8c41762a101acbf2474382f82acd977df2  Node Startup
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /tf_world_static [/tmp/binarydeb/ros-melodic-tf2-ros-0.6.5/src/static_transform_broadcaster_program.cpp:142(main)] [topics: /rosout, /tf_static] Spinning until killed publishing world to urban_qual
0.a527ae8c41762a101acbf2474382f82acd977df2 WARN /rosbag_robot_data [/tmp/binarydeb/ros-melodic-rosbag-1.14.3/src/recorder.cpp:165(Recorder::run)] [topics: /rosout] /use_sim_time set to true and no clock published.  Still waiting for valid time...
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:167(Controller)] [topics: /rosout] Waiting for /clock, /subt/start, and /subt/pose_from_artifact 

so it was “running” but did not receive any data??

was_started: 0
sim_time_duration_sec: 0
real_time_duration_sec: 0
model_count: 1 

hmm

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I run docker compose with unittest above and

bridge1_1    | [ WARN] [1573848592.055461422]: /use_sim_time set to true and no clock published.  Still waiting for valid time...

is this message OK? Is it expected?

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


but then it works locally (again I forgot about that extra 30s waiting in the original run:

solution1_1  |
solution1_1  | [ INFO] [1573848755.458464420]: Waiting for /clock, /subt/start,
and /subt/pose_from_artifact
solution1_1  | [ INFO] [1573848755.751396921, 5.708000000]: Using robot name[X2L]
solution1_1  |
solution1_1  | [ INFO] [1573848755.855716860, 5.708000000]: Sent start signal.
solution1_1  | [ INFO] [1573848756.116897510, 5.776000000]: received Clock 0
solution1_1  | [ INFO] [1573848756.305169740, 5.856000000]: received Odom 0
solution1_1  | [ INFO] [1573848756.316850391, 5.860000000]: received Imu 0
solution1_1  | [ INFO] [1573848756.333251669, 5.864000000]: received Image 0
solution1_1  | [ INFO] [1573848756.348732267, 5.872000000]: received Scan 0
solution1_1  | [ INFO] [1573848761.730888492, 7.856000000]: received Odom 100
solution1_1  | [ INFO] [1573848761.742405871, 7.860000000]: received Imu 100
solution1_1  | [ INFO] [1573848766.898960025, 9.768000000]: received Clock 1000
solution1_1  | [ INFO] [1573848767.147004835, 9.856000000]: received Odom 200
solution1_1  | [ INFO] [1573848767.153366674, 9.860000000]: received Imu 200
solution1_1  | [ INFO] [1573848769.453284897, 10.680000000]: received Image 100
solution1_1  | [ INFO] [1573848769.497763395, 10.696000000]: received Scan 100
solution1_1  | [ INFO] [1573848772.675202569, 11.856000000]: received Odom 300
solution1_1  | [ INFO] [1573848772.683042708, 11.860000000]: received Imu 300
solution1_1  | [ INFO] [1573848777.947787878, 13.768000000]: received Clock 2000

(locally)

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Alfredo Bencomo (bencomo) is it OK, that subt_solution_latest on https://hub.docker.com/r/osrf/subt-virtual-testbed/tags is 2 months old? Our image is based on this example solution, so I am just checking. Also can you please confirm that for you the unittest works locally fine?

p.s. I am going to comment out subscription of /clock topic and check it on AWS as unittest2

osrf-migration commented 5 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Martin,

I just tested the subt_solution_latest locally and still works fine with the latest Docker images for the Urban Circuit, as shown below.

osrf-migration commented 5 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Alfredo,

can you please test also the “unittest solution docker” mentioned above? Ideally locally and on AWS. I commented out subscription of /clock (https://github.com/robotika/osgar/commit/8e37ea0ea4dbd30488c5f192ae45de475b2a99c2) but on AWS is the result still the same - no stdout:

0.a527ae8c41762a101acbf2474382f82acd977df2  Node Startup
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /tf_world_static [/tmp/binarydeb/ros-melodic-tf2-ros-0.6.5/src/static_transform_broadcaster_program.cpp:142(main)] [topics: /rosout, /tf_static] Spinning until killed publishing world to urban_qual
0.a527ae8c41762a101acbf2474382f82acd977df2 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:167(Controller)] [topics: /rosout] Waiting for /clock, /subt/start, and /subt/pose_from_artifact 

and no simulation running:

was_started: 0
sim_time_duration_sec: 0
real_time_duration_sec: 0
model_count: 1 

The docker images are: 200670743174.dkr.ecr.us-east-1.amazonaws.com/subt/robotika:unittest1 and 200670743174.dkr.ecr.us-east-1.amazonaws.com/subt/robotika:unittest2

thanks

Martin

p.s. I tested it on Urban Qualification world with robot X2 config #4, 9067dcb9-ec62-42c5-b271-cb035c58d6b7 and 596715e8-e8bb-4881-8884-5dbe7c6c3294

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Martin,

We added a sidecar container to the cloudsim bridge computer on Nov 13th. This sidecar container served to capture log files in the event of an unexpected termination of Gazebo or a team solution. There is an unexplained problem with this sidecar that caused some, but not all, topics to fail. You encountered this problem with your unit test. We have since taken an alternate approach, which was deployed on November 19th. Can you please try your unit test again.

This was brief error, due to a new feature that was deployed after the Tunnel Circuit. Thanks for making your unit test. We plan to keep using it along with our other tests.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


OK, thanks for info. unittest1b started as cbb0594b-2a7d-4904-b945-0d467a6f4820 now …

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


p.s. much better :wink:

208.632000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:210(Update)] [topics: ] Sent start signal.
208.660000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:50(clockCallback)] [topics: ] received Clock 0 
208.880000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:58(imuCallback)] [topics: ] received Imu 0 
208.876000000 INFO /X0F120R/ros1_ign_bridge_twist [/tmp/ros1-ign-bridge-release/include/ros1_ign_bridge/factory.hpp:137(Factory<ROS1_T, IGN_T>::ros1_callback)] [topics: /rosout] Passing message from ROS1 geometry_msgs/Twist to Ignition ignition.msgs.Twist (showing msg only once per type
208.884000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:82(odomCallback)] [topics: ] received Odom 0
208.888000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:66(scanCallback)] [topics: ] received Scan 0
208.908000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:74(imageCallback)] [topics: ] received Image 0
...
3858.880000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:58(imuCallback)] [topics: ] received Imu 182500 
3858.884000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:82(odomCallback)] [topics: ] received Odom 182500
3858.892000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:66(scanCallback)] [topics: ] received Scan 73000
3858.904000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:74(imageCallback)] [topics: ] received Image 73000
3860.652000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:50(clockCallback)] [topics: ] received Clock 913000 
3860.880000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:58(imuCallback)] [topics: ] received Imu 182600 
3860.888000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:82(odomCallback)] [topics: ] received Odom 182600 

thanks m.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Nate Koenig (Nathan Koenig) - I also re-run ver26 (ver26urbanQe, aefa9509-28d4-4b15-bfb0-7f454daff6f3), but there is still the old problem Terminated Error: InitializationFailed

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


If I remember correctly, your version 25 worked. What had changed in your code between version 25 and version 26?

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Nate Koenig (Nathan Koenig) there was only a small change related to subscription of /clock … but I release newer one ver27 which worked fine … until I found that there was error during build and it was actually the same as ver26:

200670743174.dkr.ecr.us-east-1.amazonaws.com/subt/robotika   ver26                         6add9f8c520f        2 weeks ago         3.59GB
200670743174.dkr.ecr.us-east-1.amazonaws.com/subt/robotika   ver27                         6add9f8c520f        2 weeks ago         3.59GB

so my current guess is that it is related to wrong sleep in subt seed example:

https://github.com/robotika/osgar/pull/260#discussion_r349928330

or directly:

#!/osrf/subt_seed/src/default/docker/subt_solution/run_solution.bash

All new releases (now we are testing ver28) already use --wait instead of wrong sleep. We will see if it helps.

As a side note - the dockers are still downloading external models, so I see errors like:

0.a527ae8c41762a101acbf2474382f82acd977df2 ERROR /X0F300R/ros_ign_bridge_gas [/tmp/ros-ign-bridge-release/src/parameter_bridge.cpp:180(main)] [topics: /rosout] Failed to create a bridge for topic [/model/X0F300R/gas_detected] with ROS type [std_msgs/Bool] and Ignition Transport type [ignition.msgs.Boolean] 

and the orientation is not updated on AWS … but I expect that you plan a new release soon, maybe even today?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Sorry for the delay. I've seen more of your submissions. Are you still experiencing these problems?

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Nate Koenig (Nathan Koenig) - it is fine now (with --wait) … but what I wanted to point out is that the example with sleep is wrong/misleading.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Thanks for the clarification.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Problem has been addressed.