Closed MoffKalast closed 5 months ago
How often that happens? Can you try to reproduce it?
I noticed that Field 'sec' is not in json, default: (nil)
caused by 2D Nav Goal
tool in Vizanti, but can't reproduce crash with it.
what(): string data is not null-terminated
- can you point me what part of Vizanti is sending messages with string? Maybe it could help narrow down the culprit.
It doesn't happen super often and I haven't been able to reliably reproduce it yet, but it did happen again today:
[rws_server-1] [2024-03-09 05:33:22] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
[rws_server-1] [2024-03-09 05:33:22] [info] asio async_write error: asio.system:32 (Broken pipe)
[rws_server-1] [2024-03-09 05:33:22] [fatal] handle_write_frame error: websocketpp.transport:2 (Underlying Transport Error)
[rws_server-1] [INFO] [1709980402.070906949] [vizanti_rws_server]: Closing connection with client_id 1
[rws_server-1] [2024-03-09 05:33:22] [disconnect] Disconnect close local:[1006,End of File] remote:[1001]
[rws_server-1] [INFO] [1709980402.070970871] [client_handler_1]: Destroying client 1(16726299949654779313)
[ERROR] [rws_server-1]: process has died [pid 8522, exit code -11, cmd '/home/osboxes/colcon_ws/install/rws/lib/rws/rws_server --ros-args -r __node:=vizanti_rws_server --params-file /tmp/launch_params_8c6jimp0 --params-file /tmp/launch_params_x59tkrq7 --params-file /tmp/launch_params_ww4v8gu1'].
I'm not sure if the watchdog printouts were related at all, but it did seem to start happening after I turned it on recently. Could just be a correlation but I'll disable it again and see if I still get any.
caused by 2D Nav Goal tool in Vizanti, but can't reproduce crash with it.
I think I've also seen that happen when the nav_msgs/Path is sent from the route planner, but that may be unrelated and I need to take some time to look into it properly if it's a genuine issue.
can you point me what part of Vizanti is sending messages with string? Maybe it could help narrow down the culprit.
Afaik there aren't any topic publishers that send strings, it might be one of the services. Practically all of them are some kind of string or string array. Could also be something completely internal.
This may also be specific to Humble, I tested mostly on Galactic. Need to test more on Humble. Can you share your setup, OS, arch, network, does it crash in VM only, real hardware only, both?
If you are comfortable with C++ tools, it would be great to attach GDB to RWS node and wait till it crash, stack trace would clear things up.
Could be Humble specific, I think all of these happened on x86_64 arch, Kubuntu 22.04 with Humble, one in VirtualBox, one native install. I've also been testing on aarch64 on a Pi 4, but I don't think I seen any there yet.
it would be great to attach GDB to RWS node and wait till it crash, stack trace would clear things up.
I'll see if I can set that up.
I created a js script to stress test the server, by creating a lot of unresponsive clients that don't always respond to ping and terminate without closing TCP connection gracefully. I was able to reproduce crash with watchdog enabled. While these may not be all of the cases you've encountered, this is definitely one of them. Basically the server was trying to close TCP connection and dispose client that didn't respond to ping in time, but for which TCP connection was closed just before 'pong timeout' event.
Please check https://github.com/v-kiniv/rws/pull/23
I haven't done much more testing with ROS 2 these recent weeks, but so far I haven't seen any more crashes. It might have genuinely been just the watchdog and fully fixed in #23. I'll reopen if I manage to reproduce it consistently again.
While I was testing the vizanti occupancy grid issues I had two odd crashes that might be a result of the recent patches. There's not much to go on, but anyway here are the logs in case it helps:
I'm not sure what exactly I was doing while it happened, but they were both on cyclonedds in case that's relevant.