v-kiniv / rws

WebSocket gateway for ROS2 topics and services
Apache License 2.0
8 stars 4 forks source link

Odd crashes #20

Closed MoffKalast closed 5 months ago

MoffKalast commented 6 months ago

While I was testing the vizanti occupancy grid issues I had two odd crashes that might be a result of the recent patches. There's not much to go on, but anyway here are the logs in case it helps:

[rws_server-1] [INFO] [1709824461.473167570] [rws::translate]: Field 'sec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824461.473193070] [rws::translate]: Field 'nanosec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824489.367046726] [rws::translate]: Field 'sec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824489.367082067] [rws::translate]: Field 'nanosec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824500.429303172] [rws::translate]: Field 'sec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824500.429338840] [rws::translate]: Field 'nanosec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824544.056496212] [rws::translate]: Field 'sec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824544.056526996] [rws::translate]: Field 'nanosec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824558.377610503] [rws::translate]: Field 'sec' is not in json, default: (nil)
[rws_server-1] [INFO] [1709824558.377641409] [rws::translate]: Field 'nanosec' is not in json, default: (nil)
[ERROR] [rws_server-1]: process has died [pid 135962, exit code -11, cmd '/home/vid/colcon_ws/install/rws/lib/rws/rws_server --ros-args -r __node:=vizanti_rws_server --params-file /tmp/launch_params_ng8p7y2d --params-file /tmp/launch_params_t1imf5c3 --params-file /tmp/launch_params_hlq1qkk5'].
[rws_server-1] terminate called after throwing an instance of 'std::runtime_error'
[rws_server-1]   what():  string data is not null-terminated
[ERROR] [rws_server-1]: process has died [pid 9874, exit code -6, cmd '/home/osboxes/colcon_ws/install/rws/lib/rws/rws_server --ros-args -r __node:=vizanti_rws_server --params-file /tmp/launch_params_l861ncw7 --params-file /tmp/launch_params_jdx9glro --params-file /tmp/launch_params_vthzcqqw'].

I'm not sure what exactly I was doing while it happened, but they were both on cyclonedds in case that's relevant.

v-kiniv commented 6 months ago

How often that happens? Can you try to reproduce it? I noticed that Field 'sec' is not in json, default: (nil) caused by 2D Nav Goal tool in Vizanti, but can't reproduce crash with it. what(): string data is not null-terminated - can you point me what part of Vizanti is sending messages with string? Maybe it could help narrow down the culprit.

MoffKalast commented 6 months ago

It doesn't happen super often and I haven't been able to reliably reproduce it yet, but it did happen again today:

[rws_server-1] [2024-03-09 05:33:22] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
[rws_server-1] [2024-03-09 05:33:22] [info] asio async_write error: asio.system:32 (Broken pipe)
[rws_server-1] [2024-03-09 05:33:22] [fatal] handle_write_frame error: websocketpp.transport:2 (Underlying Transport Error)
[rws_server-1] [INFO] [1709980402.070906949] [vizanti_rws_server]: Closing connection with client_id 1
[rws_server-1] [2024-03-09 05:33:22] [disconnect] Disconnect close local:[1006,End of File] remote:[1001]
[rws_server-1] [INFO] [1709980402.070970871] [client_handler_1]: Destroying client 1(16726299949654779313)
[ERROR] [rws_server-1]: process has died [pid 8522, exit code -11, cmd '/home/osboxes/colcon_ws/install/rws/lib/rws/rws_server --ros-args -r __node:=vizanti_rws_server --params-file /tmp/launch_params_8c6jimp0 --params-file /tmp/launch_params_x59tkrq7 --params-file /tmp/launch_params_ww4v8gu1'].

I'm not sure if the watchdog printouts were related at all, but it did seem to start happening after I turned it on recently. Could just be a correlation but I'll disable it again and see if I still get any.

caused by 2D Nav Goal tool in Vizanti, but can't reproduce crash with it.

I think I've also seen that happen when the nav_msgs/Path is sent from the route planner, but that may be unrelated and I need to take some time to look into it properly if it's a genuine issue.

can you point me what part of Vizanti is sending messages with string? Maybe it could help narrow down the culprit.

Afaik there aren't any topic publishers that send strings, it might be one of the services. Practically all of them are some kind of string or string array. Could also be something completely internal.

v-kiniv commented 6 months ago

This may also be specific to Humble, I tested mostly on Galactic. Need to test more on Humble. Can you share your setup, OS, arch, network, does it crash in VM only, real hardware only, both?

If you are comfortable with C++ tools, it would be great to attach GDB to RWS node and wait till it crash, stack trace would clear things up.

MoffKalast commented 6 months ago

Could be Humble specific, I think all of these happened on x86_64 arch, Kubuntu 22.04 with Humble, one in VirtualBox, one native install. I've also been testing on aarch64 on a Pi 4, but I don't think I seen any there yet.

it would be great to attach GDB to RWS node and wait till it crash, stack trace would clear things up.

I'll see if I can set that up.

v-kiniv commented 6 months ago

I created a js script to stress test the server, by creating a lot of unresponsive clients that don't always respond to ping and terminate without closing TCP connection gracefully. I was able to reproduce crash with watchdog enabled. While these may not be all of the cases you've encountered, this is definitely one of them. Basically the server was trying to close TCP connection and dispose client that didn't respond to ping in time, but for which TCP connection was closed just before 'pong timeout' event.

Please check https://github.com/v-kiniv/rws/pull/23

MoffKalast commented 5 months ago

I haven't done much more testing with ROS 2 these recent weeks, but so far I haven't seen any more crashes. It might have genuinely been just the watchdog and fully fixed in #23. I'll reopen if I manage to reproduce it consistently again.