osrf / subt

This repostory contains software for the virtual track of the DARPA SubT Challenge. Within this repository you will find Gazebo simulation assets, ROS interfaces, support scripts and plugins, and documentation needed to compete in the SubT Virtual Challenge.
Other
305 stars 98 forks source link

CloudSim stops sending some topics #261

Closed osrf-migration closed 4 years ago

osrf-migration commented 4 years ago

Original report (archived issue) by Martin Dlouhy (Bitbucket: robotikacz).


The CloudSim or the bridge sometimes stop sending some topics. This issue was reported two months ago(?) in context of two running ROS nodes. Do you have some reference counters on bridge to check?

Here is the complete Python3 dump for robot A final run 10:

0:00:01.874250 (0.0 0.0 0.0) []
Using times [10, 2400]
Artifacts before start: []
0:01:00.113899 (15.2 -3.6 0.0) [('acc', 1031), ('origin', 178), ('pose2d', 1029), ('rot', 1031), ('scan', 206), ('sim_time_sec', 21)]
0:02:00.041882 (37.3 -3.7 0.0) [('acc', 1109), ('pose2d', 1109), ('rot', 1109), ('scan', 222), ('sim_time_sec', 22)]
0:03:00.025751 (59.8 -3.6 0.0) [('acc', 1123), ('pose2d', 1123), ('rot', 1123), ('scan', 224), ('sim_time_sec', 23)]
0:04:00.039811 (74.7 3.7 0.0) [('acc', 1152), ('pose2d', 1153), ('rot', 1152), ('scan', 231), ('sim_time_sec', 23)]
0:05:00.087357 (74.7 27.0 0.0) [('acc', 1165), ('pose2d', 1164), ('rot', 1165), ('scan', 233), ('sim_time_sec', 23)]
0:06:00.032591 (58.9 33.6 0.0) [('acc', 1160), ('pose2d', 1161), ('rot', 1160), ('scan', 232), ('sim_time_sec', 23)]
0:07:00.091972 (37.8 28.3 0.0) [('acc', 1178), ('pose2d', 1177), ('rot', 1178), ('scan', 235), ('sim_time_sec', 24)]
0:08:00.037849 (36.8 9.7 -0.0) [('acc', 1167), ('pose2d', 1168), ('rot', 1167), ('scan', 234), ('sim_time_sec', 23)]
0:09:00.124006 (34.8 26.2 -0.0) [('acc', 1163), ('pose2d', 1162), ('rot', 1163), ('scan', 232), ('sim_time_sec', 24)]
0:10:00.099786 (51.8 36.4 -0.0) [('acc', 1143), ('pose2d', 1143), ('rot', 1143), ('scan', 229), ('sim_time_sec', 22)]
0:11:00.064612 (74.2 36.4 -0.0) [('acc', 1120), ('pose2d', 1120), ('rot', 1120), ('scan', 224), ('sim_time_sec', 23)]
0:12:00.037038 (74.8 58.0 -0.0) [('acc', 1147), ('pose2d', 1147), ('rot', 1147), ('scan', 229), ('sim_time_sec', 23)]
0:13:00.078897 (74.8 81.3 -0.0) [('acc', 1165), ('pose2d', 1165), ('rot', 1165), ('scan', 233), ('sim_time_sec', 23)]
0:14:00.084369 (77.4 67.2 -0.0) [('acc', 1170), ('pose2d', 1170), ('rot', 1170), ('scan', 234), ('sim_time_sec', 23)]
0:15:00.036357 (77.4 43.9 -0.0) [('acc', 1162), ('pose2d', 1163), ('rot', 1162), ('scan', 233), ('sim_time_sec', 24)]
0:16:00.101717 (92.1 36.2 -0.0) [('acc', 1157), ('pose2d', 1157), ('rot', 1157), ('scan', 231), ('sim_time_sec', 23)]
0:17:00.174657 (115.5 36.3 -0.0) [('acc', 1175), ('pose2d', 1174), ('rot', 1175), ('scan', 235), ('sim_time_sec', 23)]
0:18:00.071696 (139.0 36.3 -0.0) [('acc', 1174), ('pose2d', 1173), ('rot', 1174), ('scan', 235), ('sim_time_sec', 24)]
0:19:00.167036 (154.7 47.1 -0.0) [('acc', 1166), ('pose2d', 1167), ('rot', 1166), ('scan', 233), ('sim_time_sec', 23)]
0:20:00.112116 (154.7 70.5 -0.0) [('acc', 1153), ('pose2d', 1171), ('rot', 1153), ('scan', 234), ('sim_time_sec', 24)]
0:21:00.103524 (155.1 92.8 -0.0) [('pose2d', 1164), ('scan', 233), ('sim_time_sec', 23)]
0:22:00.040778 (155.6 115.8 -0.0) [('pose2d', 1152), ('scan', 230), ('sim_time_sec', 23)]
0:23:00.225396 (156.0 139.0 -0.0) [('pose2d', 1159), ('scan', 232), ('sim_time_sec', 23)]
0:24:00.096273 (156.5 162.0 -0.0) [('pose2d', 1151), ('scan', 231), ('sim_time_sec', 23)]
0:25:00.087319 (156.9 185.3 -0.0) [('pose2d', 1165), ('scan', 232), ('sim_time_sec', 23)]
0:26:00.017386 (157.4 208.3 -0.0) [('pose2d', 1147), ('scan', 230), ('sim_time_sec', 23)]
0:27:00.039352 (157.8 231.5 -0.0) [('pose2d', 1163), ('scan', 233), ('sim_time_sec', 24)]
0:28:00.071255 (158.3 255.2 -0.0) [('pose2d', 1185), ('scan', 236), ('sim_time_sec', 23)]
0:29:00.015956 (158.7 278.4 -0.0) [('pose2d', 1157), ('scan', 232), ('sim_time_sec', 23)]
0:30:00.025776 (159.2 301.8 -0.0) [('pose2d', 1173), ('scan', 234), ('sim_time_sec', 24)]
0:31:00.053610 (159.6 324.9 -0.0) [('pose2d', 1152), ('scan', 231), ('sim_time_sec', 23)]
0:32:00.156292 (160.1 347.5 -0.0) [('pose2d', 1130), ('scan', 226), ('sim_time_sec', 22)]
0:33:00.095004 (160.5 369.7 -0.0) [('pose2d', 1110), ('scan', 222), ('sim_time_sec', 23)]
0:34:00.020327 (160.9 391.6 -0.0) [('pose2d', 1096), ('scan', 219), ('sim_time_sec', 22)]
0:35:00.034189 (161.3 413.5 -0.0) [('pose2d', 1097), ('scan', 220), ('sim_time_sec', 22)]
0:36:00.085905 (161.8 435.2 -0.0) [('pose2d', 1085), ('scan', 216), ('sim_time_sec', 21)]
0:37:00.052622 (162.2 457.0 -0.0) [('pose2d', 1090), ('scan', 218), ('sim_time_sec', 22)]
0:38:00.038468 (162.6 478.4 -0.0) [('pose2d', 1070), ('scan', 215), ('sim_time_sec', 21)]
0:39:00.048765 (163.0 499.6 -0.0) [('pose2d', 1061), ('scan', 212), ('sim_time_sec', 22)]
0:40:00.166637 (163.4 520.7 -0.0) [('pose2d', 1055), ('scan', 211), ('sim_time_sec', 21)]
0:41:00.116588 (163.8 541.6 -0.0) [('pose2d', 1043), ('scan', 208), ('sim_time_sec', 21)]
0:42:00.005091 (164.2 562.5 -0.0) [('pose2d', 1047), ('scan', 210), ('sim_time_sec', 21)]
0:43:00.057229 (164.6 582.8 -0.0) [('pose2d', 1013), ('scan', 202), ('sim_time_sec', 20)]
0:44:00.086354 (165.0 603.6 -0.0) [('pose2d', 1040), ('scan', 208), ('sim_time_sec', 21)]
0:45:00.079366 (165.4 624.1 -0.0) [('pose2d', 1023), ('scan', 205), ('sim_time_sec', 20)]
0:46:00.094935 (165.8 645.1 -0.0) [('pose2d', 1053), ('scan', 210), ('sim_time_sec', 21)]
0:47:00.031867 (166.2 666.1 -0.0) [('pose2d', 1049), ('scan', 210), ('sim_time_sec', 21)]
0:48:00.119254 (166.6 687.0 -0.0) [('pose2d', 1046), ('scan', 209), ('sim_time_sec', 21)]
0:49:00.007596 (167.0 707.8 -0.0) [('pose2d', 1037), ('scan', 208), ('sim_time_sec', 21)]
0:50:00.054670 (167.5 728.9 -0.0) [('pose2d', 1058), ('scan', 211), ('sim_time_sec', 21)]
0:51:00.166188 (167.9 749.9 -0.0) [('pose2d', 1047), ('scan', 210), ('sim_time_sec', 21)]
0:52:00.006919 (168.3 770.5 -0.0) [('pose2d', 1030), ('scan', 206), ('sim_time_sec', 20)]
0:53:00.057601 (168.7 791.2 -0.0) [('pose2d', 1038), ('scan', 207), ('sim_time_sec', 21)]
0:54:00.026260 (169.1 811.5 -0.0) [('pose2d', 1049), ('scan', 210), ('sim_time_sec', 21)]
0:55:00.071333 (169.5 832.3 -0.0) [('pose2d', 1040), ('scan', 208), ('sim_time_sec', 21)]
0:56:00.042111 (169.9 853.2 -0.0) [('pose2d', 1046), ('scan', 209), ('sim_time_sec', 21)]
0:57:00.013416 (170.3 874.3 -0.0) [('pose2d', 1054), ('scan', 211), ('sim_time_sec', 21)]
0:58:00.119442 (170.7 895.7 -0.0) [('pose2d', 1070), ('scan', 214), ('sim_time_sec', 21)]
0:59:00.086211 (170.8 900.1 -0.0) [('pose2d', 220), ('scan', 216), ('sim_time_sec', 5)]
1:00:00.056517 (170.8 900.1 -0.0) [('scan', 214)]
1:01:00.139475 (170.8 900.1 -0.0) [('scan', 214)]
1:02:00.042254 (170.8 900.1 -0.0) [('scan', 213)]
1:03:00.064216 (170.8 900.1 -0.0) [('scan', 218)]
1:04:00.009180 (170.8 900.1 -0.0) [('scan', 219)]
1:05:00.098884 (170.8 900.1 -0.0) [('scan', 220)]
1:06:00.197615 (170.8 900.1 -0.0) [('scan', 225)]
1:07:00.028404 (170.8 900.1 -0.0) [('scan', 222)]
1:08:00.147307 (170.8 900.1 -0.0) [('scan', 220)]
1:09:00.194171 (170.8 900.1 -0.0) [('scan', 220)]
1:10:00.149403 (170.8 900.1 -0.0) [('scan', 219)]
1:11:00.247164 (170.8 900.1 -0.0) [('scan', 224)]
1:12:00.074530 (170.8 900.1 -0.0) [('scan', 224)]
1:13:00.038614 (170.8 900.1 -0.0) [('scan', 227)]
1:14:00.014625 (170.8 900.1 -0.0) [('scan', 223)]
1:15:00.061669 (170.8 900.1 -0.0) [('scan', 222)]
1:16:00.107241 (170.8 900.1 -0.0) [('scan', 219)]
1:17:00.246864 (170.8 900.1 -0.0) [('scan', 217)]
1:18:00.093432 (170.8 900.1 -0.0) [('scan', 217)]
1:19:00.180346 (170.8 900.1 -0.0) [('scan', 214)]
1:20:00.196161 (170.8 900.1 -0.0) [('scan', 215)]
1:21:00.276935 (170.8 900.1 -0.0) [('scan', 214)]
1:22:00.145624 (170.8 900.1 -0.0) [('scan', 213)]
1:23:00.139866 (170.8 900.1 -0.0) [('scan', 216)]
1:24:00.086690 (170.8 900.1 -0.0) [('scan', 215)]
1:25:00.112151 (170.8 900.1 -0.0) [('scan', 216)]
1:26:00.242472 (170.8 900.1 -0.0) [('scan', 217)]
1:27:00.093402 (170.8 900.1 -0.0) [('scan', 216)]
1:28:00.044074 (170.8 900.1 -0.0) [('scan', 215)]
1:29:00.065824 (170.8 900.1 -0.0) [('scan', 218)]
1:30:00.198382 (170.8 900.1 -0.0) [('scan', 218)]
1:31:00.064302 (170.8 900.1 -0.0) [('scan', 215)]
1:32:00.156794 (170.8 900.1 -0.0) [('scan', 213)]
1:33:00.116688 (170.8 900.1 -0.0) [('scan', 218)]
1:34:00.245220 (170.8 900.1 -0.0) [('scan', 216)]
1:35:00.120979 (170.8 900.1 -0.0) [('scan', 216)]
1:36:00.112766 (170.8 900.1 -0.0) [('scan', 216)]
1:37:00.024872 (170.8 900.1 -0.0) [('scan', 214)]
1:38:00.079595 (170.8 900.1 -0.0) [('scan', 215)]
1:39:00.269596 (170.8 900.1 -0.0) [('scan', 216)]
1:40:00.148613 (170.8 900.1 -0.0) [('scan', 211)]
1:41:00.064843 (170.8 900.1 -0.0) [('scan', 210)]
1:42:00.268545 (170.8 900.1 -0.0) [('scan', 216)]
1:43:00.125946 (170.8 900.1 -0.0) [('scan', 215)]
1:44:00.016775 (170.8 900.1 -0.0) [('scan', 208)]
1:45:00.108588 (170.8 900.1 -0.0) [('scan', 213)]
1:46:00.123156 (170.8 900.1 -0.0) [('scan', 213)]
1:47:00.198297 (170.8 900.1 -0.0) [('scan', 206)]
1:48:00.057000 (170.8 900.1 -0.0) [('scan', 207)]
1:49:00.021796 (170.8 900.1 -0.0) [('scan', 211)]
1:50:00.067160 (170.8 900.1 -0.0) [('scan', 212)]
1:51:00.106854 (170.8 900.1 -0.0) [('scan', 208)]
1:52:00.312889 (170.8 900.1 -0.0) [('scan', 205)]
1:53:00.120459 (170.8 900.1 -0.0) [('scan', 208)]
1:54:00.220170 (170.8 900.1 -0.0) [('scan', 207)]
1:55:00.211964 (170.8 900.1 -0.0) [('scan', 212)]
1:56:00.013743 (170.8 900.1 -0.0) [('scan', 206)]
1:57:00.140487 (170.8 900.1 -0.0) [('scan', 197)]
1:58:00.078270 (170.8 900.1 -0.0) [('scan', 203)]
1:59:00.022021 (170.8 900.1 -0.0) [('scan', 209)]
2:00:00.118374 (170.8 900.1 -0.0) [('scan', 212)]
2:01:00.015736 (170.8 900.1 -0.0) [('scan', 212)]
2:02:00.103999 (170.8 900.1 -0.0) [('scan', 210)]
2:03:00.195317 (170.8 900.1 -0.0) [('scan', 214)]
2:04:00.069948 (170.8 900.1 -0.0) [('scan', 212)]
2:05:00.144081 (170.8 900.1 -0.0) [('scan', 213)]
2:06:00.255350 (170.8 900.1 -0.0) [('scan', 210)]
2:07:00.038883 (170.8 900.1 -0.0) [('scan', 211)]
2:08:00.244349 (170.8 900.1 -0.0) [('scan', 208)]
2:09:00.220553 (170.8 900.1 -0.0) [('scan', 208)]
2:10:00.012854 (170.8 900.1 -0.0) [('scan', 210)]
2:11:00.182522 (170.8 900.1 -0.0) [('scan', 205)]
2:12:00.317423 (170.8 900.1 -0.0) [('scan', 211)]
2:13:00.329937 (170.8 900.1 -0.0) [('scan', 204)]
2:14:00.237473 (170.8 900.1 -0.0) [('scan', 211)]
2:15:00.102629 (170.8 900.1 -0.0) [('scan', 208)]
2:16:00.311819 (170.8 900.1 -0.0) [('scan', 202)]
2:17:00.064137 (170.8 900.1 -0.0) [('scan', 210)]
2:18:00.297227 (170.8 900.1 -0.0) [('scan', 211)]
2:19:00.011858 (170.8 900.1 -0.0) [('scan', 205)]
2:20:00.041207 (170.8 900.1 -0.0) [('scan', 208)]
2:21:00.306937 (170.8 900.1 -0.0) [('scan', 209)]
2:22:00.014030 (170.8 900.1 -0.0) [('scan', 207)]
2:23:00.057761 (170.8 900.1 -0.0) [('scan', 211)]
2:24:00.078889 (170.8 900.1 -0.0) [('scan', 211)]
2:25:00.271348 (170.8 900.1 -0.0) [('scan', 210)]
2:26:00.023680 (170.8 900.1 -0.0) [('scan', 207)]
2:27:00.206565 (170.8 900.1 -0.0) [('scan', 209)]
2:28:00.196093 (170.8 900.1 -0.0) [('scan', 214)]
2:29:00.190260 (170.8 900.1 -0.0) [('scan', 212)]
2:30:00.132740 (170.8 900.1 -0.0) [('scan', 213)]
2:31:00.065569 (170.8 900.1 -0.0) [('scan', 214)]
2:32:00.212321 (170.8 900.1 -0.0) [('scan', 213)]
2:33:00.196690 (170.8 900.1 -0.0) [('scan', 215)]
2:34:00.186527 (170.8 900.1 -0.0) [('scan', 215)]
2:35:00.349303 (170.8 900.1 -0.0) [('scan', 215)]
2:36:00.063184 (170.8 900.1 -0.0) [('scan', 210)]
2:37:00.077993 (170.8 900.1 -0.0) [('scan', 210)]
2:38:00.253989 (170.8 900.1 -0.0) [('scan', 213)]
2:39:00.205328 (170.8 900.1 -0.0) [('scan', 206)]
2:40:00.139102 (170.8 900.1 -0.0) [('scan', 204)]
2:41:00.172922 (170.8 900.1 -0.0) [('scan', 209)]
2:42:00.261287 (170.8 900.1 -0.0) [('scan', 210)]
2:43:00.193495 (170.8 900.1 -0.0) [('scan', 212)]
2:44:00.104197 (170.8 900.1 -0.0) [('scan', 209)]
2:45:00.166760 (170.8 900.1 -0.0) [('scan', 213)]
2:46:00.088262 (170.8 900.1 -0.0) [('scan', 208)]
2:47:00.194601 (170.8 900.1 -0.0) [('scan', 209)]
2:48:00.158934 (170.8 900.1 -0.0) [('scan', 211)]
2:49:00.034882 (170.8 900.1 -0.0) [('scan', 208)]
2:50:00.161464 (170.8 900.1 -0.0) [('scan', 202)]
2:51:00.113854 (170.8 900.1 -0.0) [('scan', 211)]
2:52:00.143288 (170.8 900.1 -0.0) [('scan', 211)]
2:53:00.111603 (170.8 900.1 -0.0) [('scan', 201)]
2:54:00.109043 (170.8 900.1 -0.0) [('scan', 208)]

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


p.s. note that there should be also image which was never received

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


p.s.2 if Python3 would be suspicious then here are also some outputs form C++ code talking to the bridge:

...
87.228000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:313(Update)] [topics: ] MD robot pose -0.962074 0.291239 dist=1.01041
87.828000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:321(Update)] [topics: ] Arrived at entrance!
89.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 500 
89.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 200
89.176000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 500
91.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 600 
91.176000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 600
93.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 700 
93.176000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 700
94.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 300
95.128000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 800 
95.176000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 800
97.148000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 900 
97.148000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 900
99.148000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:76(imuCallback)] [topics: ] received Imu 1000 
99.148000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 400
99.148000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 1000
99.800000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 0:01:00.113899 (15.2 -3.6 0.0) [('acc', 1031), ('origin', 178), ('pose2d', 1029), ('rot'
...
539.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 0:20:00.112116 (154.7 70.5 -0.0) [('acc', 1153), ('pose2d', 1171), ('rot', 1153), ('scan', 234), ('sim_time_sec', 24)]
539.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 9200
539.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23000
541.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23100
543.188000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23200
544.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 9300
545.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23300
547.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23400
549.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 9400
549.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23500
551.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23600
553.144000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23700
554.140000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 9500
555.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23800
557.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 23900
559.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 9600
559.136000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 24000
561.184000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 24100
562.436000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 0:21:00.103524 (155.1 92.8 -0.0) [('pose2d', 1164), ('scan', 233), ('sim_time_sec', 23)] 
...
1366.052000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 0:58:00.119442 (170.7 895.7 -0.0) [('pose2d', 1070), ('scan', 214), ('sim_time_sec', 21)]
1367.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 64400
1369.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 25800
1369.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:103(odomCallback)] [topics: ] received Odom 64500
1374.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 25900
1379.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26000
1384.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26100
1387.652000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 0:59:00.086211 (170.8 900.1 -0.0) [('pose2d', 220), ('scan', 216), ('sim_time_sec', 5)]
1389.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26200
1394.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26300
1399.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26400
1404.152000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:85(scanCallback)] [topics: ] received Scan 26500
1409.052000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:403(getSpeedCmd)] [topics: ] Python3: stdout 1:00:00.056517 (170.8 900.1 -0.0) [('scan', 214)] 
...

osrf-migration commented 4 years ago

Original comment by Alfredo Bencomo (Bitbucket: bencomo).


Nate Koenig (Nathan Koenig) can you please triage this?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Martin,

Can you please provide a complete reproducible example of the problem you are experiencing with step-by-step instructions that demonstrates in the behavior you are seeing?

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Hello Nate,

I am afraid that it is not easy to reproduce. The first time I observed something similar was #202 when I run two nodes (original subt_seed code + wait_for_sensors.py). Here it was only one node and it failed after longer simulation. We could try to reproduce another situation which looked to be repeatable (12/12) for World B/Tree of Life, when it is available. Can you add it to existing worlds on AWS?

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


We are investigating a problem in the logs from the tunnel circuit. We are but sure whether this fits tot he Problem reported here, but I seems that we did not get any feedback from a node for a long time. Would such an example help?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Yes, and example would help, thanks.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


In Tunnel Circuit run 04: In the log rosout.log of robot X2n3 there are no messages between 677.440 and 1314.008. At minimum messages from one topic could not be received between 642 and 1437. We can see that because we do not get new position messages (“aktuelle Position”) but we get other messages from the same node into the log for the time from 642 to 644 and from 1314 to 1437. From 1437 on everything seems to have worked again. So the nodes did not crash.

So it seems that there was not only a logging issue but also a problem with topics or that nodes did pause.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


In the logs for X2n3 I see the following:

677.412000000 INFO /X2n3_robot_controller [/home/developer/subt_solution/src/subt_seed/src/robot_Controller.cpp:6312(main)] [topics: /rosout, /X2n3/cmd_vel, /X2n3/wand_visualization_marker] Ende SpinOnce
1314.008000000 INFO /X2n3_robot_controller [/home/developer/subt_solution/src/subt_seed/src/robot_Controller.cpp:6314(main)] [topics: /rosout, /X2n3/cmd_vel, /X2n3/wand_visualization_marker] Ende sleep
1314.008000000 WARN /X2n3/robot_localization [/tmp/binarydeb/ros-melodic-robot-localization-2.6.5/src/ros_filter.cpp:1819(RosFilter<T>::periodicUpdate)] [topics: /rosout, /diagnostics, /tf, /X2n3/odometry/filtered] Failed to meet update rate! Took 636.57600000000002183

At 677.412 there is a "Ende SpinOnce". This is followed by an "Ende sleep" at 1314.008. Without access to your code, I would assume that you have a 637 second sleep somewhere in your code.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


I don‘t think so. The code between these two ROS_INFO commands is only loop_rate.sleep();

And loop_eate is only initialized once: loop_rate(10);

Another Point is, that not only this node was sleeping. There where at minimum 3 nodes frozen until they woke up later.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


P.S.: one message at was repeated at 677.44 for 1475 times. And another messages was later at 1620.331 repeated for about 20000 times!

Another idea: Could it be possible that the clock node did not work correctly?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Let me know if you come up with a example of messages not transmitting. So far I have been unable to reproduce this problem.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


I don’t think that it is easily reproduceable. In Tunnel Circuit I found it in one of the 15 runs with the same configuration and the same programs.

Could you see anything unnormal in the logs of the ignition server? Would it make sense to add additional logging to the server to have more infos when we have another case of this kind?

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


There is nothing abnormal in Ignition, and nothing in the bridge. The state logs and server logs all look fine. We have been running our own solutions on Cloudsim without any problems as well.

At this point, I really need a docker image that can reliably reproduce the situation you are describing.

osrf-migration commented 4 years ago

Original comment by Reid Sawtell (Bitbucket: rsawtell).


What Sophisticated Engineering is describing sounds similar to what I encountered shortly before the competition in our catkin workspace (issue #221 (#221)). The clock would stop posting so all topics would delay for an indeterminable time until the clock started again, at which point they would resume but with a massive jump in time that usually mucks everything up. I noticed the problem more frequently in smaller testing tunnels (possibly related to real time factor? the problem started after the cap at 1 was implemented), but it did happen in the larger tunnels as well.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Nate Koenig (Nathan Koenig) - can you please direct us, where is the testing solution docker and how we could extend current test suite? I.e. also how to report success and failure. Thanks

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


What I'm looking for is a docker image and source code that reproduces the problem described in this issue.

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


Martin Dlouhy (robotikacz) can you explain what these fields are in this message?

0:01:00.113899 (15.2 -3.6 0.0) [('acc', 1031), ('origin', 178), ('pose2d', 1029), ('rot', 1031), ('scan', 206), ('sim_time_sec', 21)]

for example:

0:01:00.113899 is wall time

('sim_time_sec', 21) is the simulation time seconds and ???

('scan', 206) is the scan topic and sequence number???

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Arthur Schang (Arthur Schang) these are our internal reports for every minute of real time, so there where 21 simulated seconds and we received 206 messages with lidar scans.

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


So in the case of this message:

1:02:00.042254 (170.8 900.1 -0.0) [('scan', 213)]

You received ~200 scan messages and nothing else for that minute of wall time?

What is ('sim_time_sec', 20) indicate? Is that 20 messages on the /clock topic or something else?

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


These data were from Tunnel Circuit and at that time I used nav_msgs/Odometry timestamp, see

https://github.com/robotika/osgar/blob/SUBT_VIRTUAL_VER25/osgar/drivers/rosmsg.py#L311

And yes, all we received was only lidar scan. What we should receive is IMU (acc, rot), odometry (pose2d), and camera data (image).

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


The logs you’ve provided are in an unsupported form. What would be required is a rosbag log tracking the dropout observed through remapping of topics. Investigative attempts on CloudSim have provided no concrete evidence indicating a timing issue whereas there is plenty of anecdotal evidence of timing issues. Until we are able to reproduce these issues consistently from a standardized logging procedure, it will be incredibly difficult to solve them.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


:slight_smile: sigh, yes - that was the tunnel circuit and there I was happy that from 5*3*4 robot runs I extracted few osgar logs, mostly just text as it is above. I also wrote proposal how to re-check it via Tree of Life world, where the solution did not work at all (https://osrf-migration.github.io/subt-gh-pages/#!/osrf/subt/issues/261/cloudsim-stops-sending-some-topics (#261)#comment-54727694)

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


Arthur Schang (Arthur Schang) my colleague is testing Urban Qualification with with remote control (from console, ROS keyboard control), visualizing lidar scan and images via RViz. It is sometimes working and sometimes not and in RViz you can see that data are sometimes not coming … maybe it is the same issue?

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


p.s. he is using teleop_twist_keyboard and it typically fails before you reach the rails

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


I do not have any issues when testing locally.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


I have been trying to create a repeatable test case, but have not succeeded yet. The results of the five runs so far:

8fb4e416-b125-4628-bca2-07d6bbc8e9f1 - failed to run - admin review

c02f8fa8-6c59-46bc-8087-98865670f608 - all data stopped after a minute - not exactly the same, but similar

78f46732-d285-4235-be35-4ffb49834fc9 - no issues seen

9410d118-8241-4ff1-9e3c-70630b0c6888 - no issues seen

eac91df0-fbbc-4b55-a775-e76e963a133e - no issues seen

The approach was to modify subt_seed to log the number of messages received each second and to add some processing so that it used more of the CPU.

osrf-migration commented 4 years ago

Original comment by Sarah Kitchen (Bitbucket: snkitche).


GoRobotGo (GoRobotGo) , we have added something similar to our solution with 5 second wall clock intervals and counting /clock messages. Of the 6 runs we submitted the last two days (3 in Urban Simple 1 and 3 in Urban Practice 1), I am seeing evidence of issues in several runs - I have not quantified the severity or effect of the issue yet, just comparing # messages per 5 second interval vs. RTF to check for consistency.

I am preparing information to send to the subt-help email. I have already sent the image URL I am using.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


I believe the underlying issue can be traced to TCP buffer overruns. Basically, a TCP buffer can overflow if messages are sent too fast and/or not read fast enough on the other side. We have reduced the likelihood of this in https://bitbucket.org/ignitionrobotics/ign-transport/pull-requests/416/subt-flood-testing/diff. This fix has been deployed to cloudsim.

However, it's also important to not block transport callbacks in your solution. Doing so will cause you to miss messages. Take a look at this test code for some guidance.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


This problem is back again (robotika, Ver45p1, 3765be69-bb4b-46f7-b63f-715b861b0a26). I checked the camera of robot B, which can see start of robot A and stops receiving data shortly after A crash (yes, this robotika release was not great), and then B no longer has camera data (note, that B did not move yet).

0:01:00.103055 (0.0 0.0 0.0) [('acc', 672), ('artf', 3), ('image', 268), ('origin', 269), ('pose2d', 667), ('rot', 672), ('scan', 93), ('sim_time_sec', 14)]
0:02:00.148753 (0.0 0.0 0.0) [('acc', 683), ('image', 273), ('origin', 273), ('pose2d', 682), ('rot', 683), ('sim_time_sec', 14)]
0:03:00.148457 (0.0 0.0 0.0) [('acc', 667), ('image', 267), ('origin', 267), ('pose2d', 668), ('rot', 667), ('sim_time_sec', 13)]
0:04:00.037466 (0.0 0.0 0.0) [('acc', 661), ('image', 264), ('origin', 265), ('pose2d', 661), ('rot', 661), ('sim_time_sec', 13)]
0:05:00.223220 (0.0 0.0 0.0) [('acc', 667), ('image', 108), ('origin', 266), ('pose2d', 667), ('rot', 667), ('sim_time_sec', 14)]
0:06:00.098936 (0.0 0.0 0.0) [('acc', 669), ('origin', 266), ('pose2d', 669), ('rot', 669), ('sim_time_sec', 13)]
0:07:00.142890 (0.0 0.0 0.0) [('acc', 673), ('origin', 268), ('pose2d', 673), ('rot', 673), ('sim_time_sec', 14)]
0:08:00.103073 (0.0 0.0 0.0) [('acc', 671), ('origin', 267), ('pose2d', 482), ('rot', 671), ('sim_time_sec', 13)]
0:09:00.211639 (0.0 0.0 0.0) [('acc', 682), ('origin', 273), ('rot', 682), ('sim_time_sec', 14)]
0:10:00.137003 (0.0 0.0 0.0) [('acc', 681), ('origin', 271), ('rot', 681), ('sim_time_sec', 13)]

Note, that for example robot E runs fine (all 5 robots have the same docker image).

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


the last run seems to be partially related to our bug (scan data) - they are counted on input (C++ code), but not at the very end (Python3, output of my previous post). But the issue is still valid for “Image” (last report):

157.076000000 INFO /subt_solution [/home/developer/subt_solution/src/subt_seed/src/subt_seed_node.cc:116(imageCallback)] [topics: ] received Image 2100 

and the reason could be that it was not read & processed in time.

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


I would just update note, which I also sent to subt-help@, that this problem is back again … or it is observable under heavy load? Some runs are totally unusable while in others the robots score 5 points (ver51p1 vs. ver51p1b - the same robotika docker image). I am just curious if other teams observe similar issue? Is it possible to check on CloudSim side (or better the bridge side), that the sensor data are no longer re-transmitted? Also it would be nice to know what the control program could do in such a fatal case? thanks m.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


I can confirm that this bug is back in force. All four runs that I attempted yesterday were taken out by this bug or a variant.

2c05f538-2b55-4f91-b4a2-717537db0898 - image data stopped after about 80 seconds

41a9398c-2816-450c-9a15-a1dc95a9b505 - IMU data stopped after 500 seconds

a68ab8f4-cc4f-459b-9290-f0bcc6f2ccdf - no data appears to have arrived

28be546f-d924-4e56-957d-c17f76e3df8b - no image data

osrf-migration commented 4 years ago

Original comment by Martin Dlouhy (Bitbucket: robotikacz).


OK, Kevin thanks for info - now I feel slightly better :wink:

osrf-migration commented 4 years ago

Original comment by Reid Sawtell (Bitbucket: rsawtell).


We had this problem yesterday as well. We have a node who’s only job is to monitor the clock topic and compare it to system time and there were long gaps of no clock messages received at all on several of our runs. For example, in one run we stopped getting clock messages at ROS time 2556.108, they did not resume until ROS time 3829.104.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


GoRobotGo: When you say "image data stopped" or "imu data stopped", do you mean that at a specific time those topics never sent another message while other topics continued sending data?

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


That is what is happening to us - some topics keep being received while others stop at certain point and maybe reappear later. You can check in our rosout log. We print info line for each 100 messages received for a topic. See for example http://osgar.robotika.cz/subt/virtual/ver53p1/B-rosout.log:

210.332000000 INFO [subt_seed_node.cc:125(odomCallback)] [] received Odom 600
211.388000000 INFO [subt_seed_node.cc:107(scanCallback)] [] received Scan 200
220.864000000 INFO [subt_seed_node.cc:125(odomCallback)] [] received Odom 700
220.872000000 INFO [subt_seed_node.cc:125(odomCallback)] [] received Odom 800
220.872000000 INFO [subt_seed_node.cc:125(odomCallback)] [] received Odom 900
220.872000000 INFO [subt_seed_node.cc:107(scanCallback)] [] received Scan 300
220.876000000 INFO [subt_seed_node.cc:125(odomCallback)] [] received Odom 1000
221.072000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 400 
221.072000000 INFO [subt_seed_node.cc:116(imageCallback)] [] received Image 0
222.628000000 INFO [subt_seed_node.cc:134(gasCallback)] [] received Gas 200
224.232000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 500 
225.068000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 600 
225.176000000 INFO [subt_seed_node.cc:89(clockCallback)] [] received Clock 3000 
227.084000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 700 
229.100000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 800 
229.852000000 INFO [subt_seed_node.cc:89(clockCallback)] [] received Clock 4000 
231.200000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 900 
233.084000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 1000 
234.960000000 INFO [subt_seed_node.cc:89(clockCallback)] [] received Clock 5000 
235.112000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 1100 
237.112000000 INFO [subt_seed_node.cc:98(imuCallback)] [] received Imu 1200 

As you can see from this excerpt, the topics come and go at random - note for example how many other messages we have received before Clock appeared again, Odom and Scan disappeared etc.

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


The number at the end of the lines is the total number of messages received from a given topic.

osrf-migration commented 4 years ago

Original comment by GoRobotGo (Bitbucket: GoRobotGo).


For 2c05f538-2b55-4f91-b4a2-717537db0898 - yes image data stopped, but IMU data continued

For 41a9398c-2816-450c-9a15-a1dc95a9b505 - maybe - I looked at this more closely image data dropped out early once, then again at 419 and then imu at 504. After IMU dropped it was hard to tell.

For the other 2 my logs didn’t have enough data to tell.

osrf-migration commented 4 years ago

Original comment by Sophisticated Engineering (Bitbucket: sopheng).


We also had 3 of 4 runs witch might have the same problem. Only one showed results as expected. But we did not have time to check in more detail.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


In case it helps support the hypothesis that this occurs under heavy load, I had 6 simulations last night all auto-restart at the same time, and I believe all 6 hit this issue. An earlier simulation, started on its own, appeared to be fine. I haven’t investigated very much yet though to figure out exactly what was going on, but whatever it was caused move_base to fail. I will update if I find out anything else that’s useful.

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


I am trying to confirm the hypothesis that the problem happens under heavy load. TL DR: the load must be really really heavy but I was able to reproduce it locally.

Our solution contains 2 processes - first is a ros cpp node that subscribes to all topics we need and forwards all received data to the second process which is python implementation of our controller (which is completely ROS independent). On my notebook the ros node takes around 40% CPU and the python process takes around 300% CPU (I have total of 8 logical cores at 3.8Gz).

In order to simulate the problem I used taskset to move both processes (with all its threads) to a single CPU core. The python part contains some adaptation to load so the result was 40% for ros proxy and 60% for python controller. However, the controller was driving the robot around just fine.

Then I reniced the ros proxy to 19 and only then the problems appeared. In order to better see what is going on, it is beneficial to enable ROS DEBUG logging (see https://github.com/robotika/osgar/pull/303/commits). Then I get logs of messages being dropped due to queues being full. The fastest topics go down first, like ie. clock, then imu etc.

So my hypothesis is that when cloudsim runs many simulations in parallel, the controllers get significantly less CPU then before when almost nothing has been running. It would be nice to check if this is indeed what is happening.

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Issue #262 was marked as a duplicate of this issue.

osrf-migration commented 4 years ago

Original comment by Arthur Schang (Bitbucket: Arthur Schang).


I was unable to reproduce message dropout locally when using a docker-compose solution with 10 X1_SENSOR_CONFIG_4 vehicles, 10 bridge containers, and 10 simplistic solution containers. The simulation runs at tortoise speed and messages get published simultaneously (but in the right order) without any dropout. The sporadic dumping of messages under a heavy load is certainly something I'll continue to look at.

osrf-migration commented 4 years ago

Original comment by Steven Gray (Bitbucket: stgray).


Could #338 be contributing to this issue? Seeing ~2500 msgs per second per vehicle sent over the bridge, when more than half of these continuously relayed transforms are static.

osrf-migration commented 4 years ago

Original comment by Malcolm Stagg (Bitbucket: malcolmst7).


Won’t help for a local repro, but I just saw some good info here BTW on AWS PPS limits: https://www.bluematador.com/blog/ec2-packets-per-second-guaranteed-throughput-vs-best-effort

TL;DR it sounds like the limits are very high (~1M PPS) for a lot of instance types, but there is a lot of variation and very little is guaranteed. Apparently there is a “guaranteed” (though not really) limit as well as a higher best-effort limit. Seems possible that could be a factor if AWS is throttling the limits more when they are closer to capacity.

osrf-migration commented 4 years ago

Original comment by Zbyněk Winkler (Bitbucket: Zbyněk Winkler (robotika)).


Latest experience suggest that to get closer to reproducing the problem, the simulation needs to run as fast as possible and the bridge and solution containers need to be stressed by the lack of CPU power (using some taskset and renice magic).

Another hypothesis we are currently exploring is working with disk within the hotpath. Friends more experienced with working in the cloud suggest that the disk performance of the cloud machines is really nondeterministic (=depending on load from other customers) and to get any kind of reliable performance all disk access needs to be async.

But frankly, ROS being “helpful” by dropping messages under load and not telling anyone unless you turn on the DEBUG (!!!) messages was a real surprise to me (I guess it is obvious that was not a pleasant surprise). I would expect that to fall under WARNING at least (if not ERROR).

osrf-migration commented 4 years ago

Original comment by Nate Koenig (Bitbucket: Nathan Koenig).


Cloudsim has been updated to include three new logging capabilities:

  1. Network traffic. The simulation container and bridge containers log network traffic using bwm-ng to a network_traffic.csv file. The CSV file containers headers that explains the data in each column. For reference, the EC2 machines used by cloudsim have a maximum bandwidth of 10 Gbps.

  2. ROS topic debug. The ROS console logs will now contain information about full queues. Using the following command to check for the occurrence of full queues: grep -ri "discarding oldest message" *

  3. Sensor data sequence numbers. Robot logs will contain a number of log files that look like bridge_logger--<topic_name>.log. These logs are white space delimited. The first column contains the sensor topic sequence number, second column is simulation time, and third is the real-time as measured by the Bridge. gnuplot can be used to plot the data. The following is an example.

plot 'bridge_logger--X2N1-front_scan.log' using 2:xtic(1) with lines title 'Sim Time', 'bridge_logger--X2N1-front_scan.log' using 3:xtic(1) with line title 'Real Time'

Example output: example_plot.png

The bridge_loggger files will also contain the text "Error: Missed message(s)" if messages were dropped or arrived out of order.

nkoenig commented 4 years ago

Re-opening because it was automatically closed in #402.

nkoenig commented 4 years ago

The release on May 15, which included #402 potentially fixes this issue.