Closed liborw closed 11 years ago
'looked at the packets using Wireshark': did you use packet-simplemessage
?
Could you attach / upload a Wireshark capture somewhere?
Yes I have, the file can be downloaded here, but it will expire in one day.
I have an update on this, adding a MSG_WAITALL
flag to the RECV
command in simple_message/src/socket/tcp_socket.cpp:84 solved the problem. Is there a reason not to have this flag there?
Ah, this might then be related to swri-ros-pkg/issues/23.
Yes, the first few simple_message
packets arrive over 2 TCP frames (packet 786 in your capture). Seems the deserialisation step is confused by this.
I think this assessment is probably correct.
While Libor's fix probably solves this issue in the short-term, I think we should hold off on pushing this change to the public repository. This change causes the socket code to wait forever until the requested bytes are read. In the event of datastream errors, this might lead to long waits with no errors. A better solution would be to properly handle both split and merged simple_message packets in a mechanism that also provides timeout-capable waits. So that we could wait for the expected # of bytes, possibly across multiple packets, but return after a specified timeout period if the expected # of bytes is not received. This is indeed the issue logged in bug #23.
Please close this bug if your issue is resolved.
One idea might be to utilise the following pattern:
recv(..)
)len(my_buffer) >= nr_of_bytes_needed: handle(my_buffer[:nr_of_bytes_needed])
my_buffer
Fairly standard. If there aren't enough bytes, wait for them to come in the next recv(..)
(but with a configurable timeout).
That's mostly what I was thinking. Here are two possible variations: 1) only read up to the number of expected bytes (not full receive buffer)
my_buffer
but potentially means slightly more reads of the socket buffers (not very expensive)
2) use SO_RCVTIMEO option on the socket, to automatically enforce the desired recv timeout. This may not be well-supported cross-platform (e.g. DX100 C++ server code), but could be #ifdef
controlled to the linux side only. This would be a relatively painless fix, I think.
We should probably log this as an enhancement, or continue discussion on the original bug, since I don't know that the current discussion is directly relevant to Libor's error messages.
This is probably a good solution, but how would we test it? Has anybody seen this error before? Is there anything special about Libor's setup that would cause this? Perhaps network topology? We've always communicated with the robot on a small (single switch) network.
I think testing would be fairly easy. We could "hand-craft" TCP message packets that are either the combination of two SimpleMessage packets, or split a single message into multiple packets. These packets could be sent (with optional time-delays) to test the merged/split packet handling. I think that with TCP_NODELAY set, all packets should be sent "immediately".
I do remember this being an issue before. I can't remember the details, but I think it might have been Ed Venator using a wireless link to his ABB robot. The wireless delays caused split packets.
If you go the receive timeout way, why not just use a select(..)
with O_NONBLOCK
on the socket. That is the 'normal' way AFAIK. Again, #ifdef
ed, as I don't think many of the controllers -- if they support C/C++ -- will support select
.
As for the multiple-packets-in-single-TCP-frames or single-packet-over-multiple-TCP-frames: I've been able to reproduce that fairly easily (at least for the Fanuc). Either set a very high update rate on the controller side (so send out a lot of packets/sec), or configure the socket in 'interactive' mode. The latter is probably mapped onto TCP_NODELAY
: every write results in a packet. These two result in both extremes.
As for testing: if you mean manually: there are tools that can replay a Wireshark capture, I think I can get some fairly complicated (de)serialisation situations using the technique described above. If you mean automatically: just create some hand-crafted packets (byte buffers) and feed them to the deserialisation routines by hand. That should be doable. (right, JeremyZoss already said that)
select
-based timeouts would work, and are probably more "standard", as you suggest. But I think the SO_RCVTIMEO
method is more elegant (when supported), because it can be dropped into the current implementation with minimal changes. It also allows you to block until the desired # of bytes are read, while select()
only considers whether any data is available or not.
In my opinion, select
is a cleaner way to do the implementation you originally suggested (loop over recv(NOWAIT)
until timeout or all data received). This eliminates the need to do a "polling" or "busy-wait" loop waiting for data to arrive. The select() call will block until data (or timeout).
But I still think the SO_RCVTIMEO
option is more elegant, if it works. That's just my vote, though. I'm probably happy to leave it up to whoever feels motivated to do the implementation. =)
Another related issue (google code): https://code.google.com/p/swri-ros-pkg/issues/detail?id=55
Fixed
When testing the motoman driver, I got this error:
When started with DEBUG logging level the ouput.
So, the problem is that the size of the ByteArray returned by
receiveBytes
is shorter that the expected message length (that should be checked). I have looked on the packets using Wireshark and they seems to be ok.