Closed MoffKalast closed 8 months ago
Sorry for late reply, had other things going. So I tested this by running a full stack project with lidar, camera feed, nav2, etc., connected from 2 Foxglove clients(connected to the same set of topics) and everything seems to work fine. Can you please test with Foxglove? Maybe it's a roslibjs thing related to https://github.com/v-kiniv/rws/issues/6, although I'm not sure because you mention separate browser windows.
Besides, the watchdog should take care of those instances to some extent
Yes, watchdog uses an active ping-pong mechanism to drop clients that don't respond in time.
Hey, no worries.
Well I've tested a bit more and I gotta say I'm more confused than when I started. Sometimes I see it, other times I can even run two foxglove clients and one roslibjs client in parallel and it's all fine. But two roslibjs instances will always cause problems it seems.
Here's how it looks with two Foxglove clients glitching out, without any roslibjs clients connected:
Only one seems to get the /tf topic, the other just gets /tf_static.
I usually run without the watchdog (in theory so I don't disconnect over bad wifi and I usually only run one instance anyway), but when I enabled it for testing I did get some interesting logs, it seems to continuously disconnect and reconnect:
[rws_server-1] [INFO] [1709492493.800750880] [client_handler_49]: Constructing client 49(2672446178919644801)
[rws_server-1] [2024-03-03 14:01:33] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
[rws_server-1] [INFO] [1709492493.801005361] [vizanti_rws_server]: Closing connection with client_id 49
[rws_server-1] [INFO] [1709492493.801118947] [client_handler_49]: Destroying client 49(2672446178919644801)
[rws_server-1] [2024-03-03 14:01:33] [disconnect] Disconnect close local:[1006,End of File] remote:[1006]
[rws_server-1] [2024-03-03 14:01:37] [connect] WebSocket Connection [::ffff:192.168.1.4]:52987 v13 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) FoxgloveStudio/1.84.0 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36" / 101
[rws_server-1] [INFO] [1709492497.092450425] [client_handler_50]: Constructing client 50(2672446178919644801)
[rws_server-1] [2024-03-03 14:01:37] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
[rws_server-1] [INFO] [1709492497.093433828] [vizanti_rws_server]: Closing connection with client_id 50
[rws_server-1] [INFO] [1709492497.093503274] [client_handler_50]: Destroying client 50(2672446178919644801)
[rws_server-1] [2024-03-03 14:01:37] [disconnect] Disconnect close local:[1006,End of File] remote:[1006]
[rws_server-1] [2024-03-03 14:01:42] [connect] WebSocket Connection [::ffff:192.168.1.4]:52996 v13 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) FoxgloveStudio/1.84.0 Chrome/120.0.6099.56 Electron/28.0.0 Safari/537.36" / 101
I tested more, also with Vizanti instead of Foxglove, and I noticed that the robot model is stuck(unlike Rosbridge), after reconnecting from the second client, and this is consistent with what you found regarding /tf and /tf_static. So looked at the transform topics and I think I found the source of the issue, it's related to QoS and caching(again). RWS caches topic subscriptions and this causes problems if topic QoS is set to Transient Local. New clients do not receive persisting messages from such topics. I fixed it by checking QoS before cache/do not cache decision. Please check https://github.com/v-kiniv/rws/pull/18.
Regarding watchdog, let's discuss in separate issue if the problem persist.
As a side note, I noticed that "Costmap Colouring" option for Map widget in Vizanti does not work with Rosbridge but works fine with RWS. Did you notice something similar?
Top image RWS, bottom image Rosbridge:
I've tried out #18, it fixes it perfectly on my end. 👍 I've tested up to 8 concurrent Vizanti instances and a Foxglove client all at the same time and it seems to work very well.
As a side note, I noticed that "Costmap Colouring" option for Map widget in Vizanti does not work with Rosbridge but works fine with RWS. Did you notice something similar?
Hmm, well I've fixed some issues with latched maps not loading at all recently, since Firefox has a bug with rendering offscreen canvases in workers, but that should be all fixed and on Chromium it was never a problem.
Does the regular colouring work on your end? There really shouldn't be any difference between the two, just that it takes an additional message to re-render the new colour scheme if it's swapped, which is sometimes an issue for latched maps. What I think happens is that when the web client unregisters and subscribes, rosbridge caches the topic, so it doesn't get and forward the latched message again, because it's already received it once on the ROS side. But I don't think I've seen it get into this state consistently.
In my current nav2 demo setup in a VM it seems to be working fine for both costmaps and regular maps, on rosbridge, rws, firefox and chromium and all combinations of those.
Regarding watchdog, let's discuss in separate issue if the problem persist.
Well I'm no longer seeing those repeat drops and reconnects now that all clients run fine, so I figure it was related. I'll have to do some field tests with it on and off to see what happens at extreme wifi range and if it there's any benefit to having it disabled or the opposite. It might be beneficial to leave it on and reduce tcp congestion when there's extreme packet loss I guess.
I've tried out https://github.com/v-kiniv/rws/pull/18, it fixes it perfectly on my end. 👍 I've tested up to 8 concurrent Vizanti instances and a Foxglove client all at the same time and it seems to work very well.
Great, I'll merge the PR then.
Does the regular colouring work on your end?
Here is the regular colouring. It looks like the same situation, but the other way around. Top image - RWS, bottom - Rosbridge:
And here's coloured again but in Firefox:
Great, I'll merge the PR then.
Cool, we can close then.
Here is the regular colouring. It looks like the same situation, but the other way around.
Ok that's genuinely weird. Are you using the same config and topics for the first two (global settings, export layout, then import)? The inflated one should be set to translucent + costmap rendering. But the lack of gray areas is puzzling regardless and I have seen it before a while back, live on a robot doing slam. At the time I figured it's something to do with the slam_toolbox config since I was seeing the same weird thing in rviz as well, but didn't check what happens over rosbridge at the time.
I've tested all four combinations + rviz with the two setups I have on hand and I don't see any difference with the same config, although rviz does render costmaps slightly differently.
Nav2 demo with turtlebot3 and premade map published by the map_server:
Slam toolbox on some bag data:
That last rws on firefox looks like there's some tf missing, but it's just the bag ending and it looked the same before that.
Any chance you could send over a short bag of that test setup you've got if it's consistently repeatable? It would be a great help in figuring this out. Maybe we can move the discussion over to https://github.com/MoffKalast/vizanti/issues since it's a bit off topic for this thread.
Any chance you could send over a short bag of that test setup you've got if it's consistently repeatable?
Sure.
Maybe we can move the discussion over to https://github.com/MoffKalast/vizanti/issues since it's a bit off topic for this thread.
Agree, I'll create an issue.
Thanks again for your help in fixing yet another bug!
So this is something more of a suggestion than a bug report, I've noticed that there can seemingly only be one client connected to one topic at a time.
If I e.g. open two browser windows that both subscribe to the same long list of topics at roughly the same time, they'll seemingly split the list in two, one receiving some and the other what's left depending on which connection gets there first. I suspect some of the bugs I've seen in the past were made a bit more confusing due to this.
There is the issue of course that mobile browser suspended clients or background windows will pull resources when not actually being used, which serving only the latest one solves, but it also prevents multiple instances of the same app running concurrently. Besides, the watchdog should take care of those instances to some extent, right?
The rosbridge default behaviour is to send all data to all clients, so it would be nice to have parity there.