Number of received bytes differ from message length (SmplMsgConnection::receiveMsg).

liborw commented 11 years ago

When testing the motoman driver, I got this error:

$ rosrun fs100 robot_state
[ INFO] [1374677672.437211315]: Added message handler for message type: 0
[ INFO] [1374677672.440014448]: Robot state connecting to IP address: '192.168.0.101:50241'
[ WARN] [1374677672.440910442]: Unable to find user-specified joint names in 'controller_joint_names'
[ WARN] [1374677672.441766703]: Unable to find URDF joint names in 'robot_description'
[ INFO] [1374677672.441864812]: Using standard 6-DOF joint names: [joint_1, joint_2, joint_3, joint_4, joint_5, joint_6]
[ INFO] [1374677672.442550563]: Connected to server
[ INFO] [1374677672.442607446]: Initializing message manager with default comms fault handler
[ INFO] [1374677672.442657083]: Default communications fault handler successfully initialized
[ INFO] [1374677672.442685145]: Initializing message manager
[ INFO] [1374677672.442726762]: Added message handler for message type: 1
[ INFO] [1374677672.446381376]: Added message handler for message type: 15
[ INFO] [1374677672.447653822]: Entering message manager spin loop
[ERROR] [1374677672.495079875]: Get unload pointer failed, buffer size: 3, smaller than byte size: 4
[ERROR] [1374677672.495149280]: Unload pointer returned NULL
[ERROR] [1374677672.495206461]: Failed to unload message joint: 5 from data[3]
[ERROR] [1374677672.495241827]: Failed to unload joint feedback positions
[ERROR] [1374677672.495278751]: Failed to unload joint feedback message data
[ERROR] [1374677672.495314845]: Failed to initialize joint feedback message
[ERROR] [1374677672.495351389]: Failed to convert SimpleMessage
[ WARN] [1374677672.495409474]: Recieved zero bytes: 0
[ERROR] [1374677672.495445331]: Failed to initialize message
[ERROR] [1374677672.495485843]: Failed to receive incoming message
[ WARN] [1374677672.495522284]: Send failure, no callback support
^C[ INFO] [1374677674.162045284]: Connection failed, attempting reconnect
[ERROR] [1374677674.162110342]: Failed to connect to server, rc: -1, errno: 106
[ WARN] [1374677674.162132291]: Not connected, bytes not sent
[ERROR] [1374677674.162152122]: Failed to receive message length
[ERROR] [1374677674.162164475]: Failed to receive incoming message
[ WARN] [1374677674.162174342]: Send failure, no callback support

When started with DEBUG logging level the ouput.

So, the problem is that the size of the ByteArray returned by receiveBytes is shorter that the expected message length (that should be checked). I have looked on the packets using Wireshark and they seems to be ok.

gavanderhoorn commented 11 years ago

'looked at the packets using Wireshark': did you use packet-simplemessage?

Could you attach / upload a Wireshark capture somewhere?

liborw commented 11 years ago

Yes I have, the file can be downloaded here, but it will expire in one day.

liborw commented 11 years ago

I have an update on this, adding a MSG_WAITALL flag to the RECV command in simple_message/src/socket/tcp_socket.cpp:84 solved the problem. Is there a reason not to have this flag there?

gavanderhoorn commented 11 years ago

Ah, this might then be related to swri-ros-pkg/issues/23.

Yes, the first few simple_message packets arrive over 2 TCP frames (packet 786 in your capture). Seems the deserialisation step is confused by this.

JeremyZoss commented 11 years ago

I think this assessment is probably correct.

While Libor's fix probably solves this issue in the short-term, I think we should hold off on pushing this change to the public repository. This change causes the socket code to wait forever until the requested bytes are read. In the event of datastream errors, this might lead to long waits with no errors. A better solution would be to properly handle both split and merged simple_message packets in a mechanism that also provides timeout-capable waits. So that we could wait for the expected # of bytes, possibly across multiple packets, but return after a specified timeout period if the expected # of bytes is not received. This is indeed the issue logged in bug #23.

Please close this bug if your issue is resolved.

gavanderhoorn commented 11 years ago

One idea might be to utilise the following pattern:

read all bytes currently in the receive buffer (a single recv(..))
copy those bytes to a internal buffer
if len(my_buffer) >= nr_of_bytes_needed: handle(my_buffer[:nr_of_bytes_needed])
remove bytes from my_buffer
goto 'start'

Fairly standard. If there aren't enough bytes, wait for them to come in the next recv(..) (but with a configurable timeout).

JeremyZoss commented 11 years ago

That's mostly what I was thinking. Here are two possible variations: 1) only read up to the number of expected bytes (not full receive buffer)

this prevents the need to shift bytes in my_buffer
but potentially means slightly more reads of the socket buffers (not very expensive)

2) use SO_RCVTIMEO option on the socket, to automatically enforce the desired recv timeout. This may not be well-supported cross-platform (e.g. DX100 C++ server code), but could be #ifdef controlled to the linux side only. This would be a relatively painless fix, I think.

We should probably log this as an enhancement, or continue discussion on the original bug, since I don't know that the current discussion is directly relevant to Libor's error messages.

shaun-edwards commented 11 years ago

This is probably a good solution, but how would we test it? Has anybody seen this error before? Is there anything special about Libor's setup that would cause this? Perhaps network topology? We've always communicated with the robot on a small (single switch) network.

JeremyZoss commented 11 years ago

I think testing would be fairly easy. We could "hand-craft" TCP message packets that are either the combination of two SimpleMessage packets, or split a single message into multiple packets. These packets could be sent (with optional time-delays) to test the merged/split packet handling. I think that with TCP_NODELAY set, all packets should be sent "immediately".

I do remember this being an issue before. I can't remember the details, but I think it might have been Ed Venator using a wireless link to his ABB robot. The wireless delays caused split packets.

gavanderhoorn commented 11 years ago

If you go the receive timeout way, why not just use a select(..) with O_NONBLOCK on the socket. That is the 'normal' way AFAIK. Again, #ifdefed, as I don't think many of the controllers -- if they support C/C++ -- will support select.

As for the multiple-packets-in-single-TCP-frames or single-packet-over-multiple-TCP-frames: I've been able to reproduce that fairly easily (at least for the Fanuc). Either set a very high update rate on the controller side (so send out a lot of packets/sec), or configure the socket in 'interactive' mode. The latter is probably mapped onto TCP_NODELAY: every write results in a packet. These two result in both extremes.

gavanderhoorn commented 11 years ago

As for testing: if you mean manually: there are tools that can replay a Wireshark capture, I think I can get some fairly complicated (de)serialisation situations using the technique described above. ~~If you mean automatically: just create some hand-crafted packets (byte buffers) and feed them to the deserialisation routines by hand. That should be doable.~~ (right, JeremyZoss already said that)

JeremyZoss commented 11 years ago

select-based timeouts would work, and are probably more "standard", as you suggest. But I think the SO_RCVTIMEO method is more elegant (when supported), because it can be dropped into the current implementation with minimal changes. It also allows you to block until the desired # of bytes are read, while select() only considers whether any data is available or not.

In my opinion, select is a cleaner way to do the implementation you originally suggested (loop over recv(NOWAIT) until timeout or all data received). This eliminates the need to do a "polling" or "busy-wait" loop waiting for data to arrive. The select() call will block until data (or timeout).

But I still think the SO_RCVTIMEO option is more elegant, if it works. That's just my vote, though. I'm probably happy to leave it up to whoever feels motivated to do the implementation. =)

shaun-edwards commented 11 years ago

Another related issue (google code): https://code.google.com/p/swri-ros-pkg/issues/detail?id=55

shaun-edwards commented 11 years ago

Fixed

ros-industrial / industrial_core

Number of received bytes differ from message length (SmplMsgConnection::receiveMsg). #25