Point streaming queue has reached max allowed elements

rr-tom-noble commented 3 years ago

Hi,

I'm currently attempting to use the point streamer through the joint command topic to publish commands to the driver, however, I am getting the following error messages:

[ERROR] [1629453571.361495217] [/yaskawa/motion_streaming_interface]: Point streaming queue has reached max allowed elements
[ERROR] [1629453571.385527740] [/yaskawa/motion_streaming_interface]: Point streaming queue has reached max allowed elements
[ERROR] [1629453571.452335748] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)
[ERROR] [1629453571.518157040] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)
[ERROR] [1629453571.585306160] [/yaskawa/motion_streaming_interface]: Aborting point stream operation.  Failed to send point (#9): Invalid message (3) : Trajectory start position doesn't match current robot position (3011)

The second error only ever appears after the first, so I think the max queue error is the root of the problem. I'm publishing commands at a rate of 100 Hz, so I don't think I should be pushing the driver to any kind of limit. I've tried searching through the code but have been unable to find where abouts the error is raised.

Any help would be much appreciated!

gavanderhoorn commented 3 years ago

I would suggest to configure your ROSCONSOLE_FORMAT env var to include file and line nr information. That would make it a lot easier to find where this is being printed.

See Console Output Formatting on the ROS wiki.

Some expectation management: the code you're using was never merged into the mainline driver and has therefore not seen the same level of use and debugging. It's possible there are issues which have not been addressed.

As to the error: the queue on the MotoROS side is 'emptied' based on whether segments have been executed. If your dX is larger than can be covered in the fixed dT (given current state of the robot), I could imagine overrunning the queue at some point. But this is just a guess.

edit:

Some expectation management: the code you're using was never merged into the mainline driver and has therefore not seen the same level of use and debugging. It's possible there are issues which have not been addressed.

and if you've merged it with current HEAD, there are even more possibilities for unknown issues, as motoman_driver (and MotoROS) have changed quite a bit since 2015/16, when that code was written.

rr-tom-noble commented 3 years ago

Thanks @gavanderhoorn. Do you know if there's equivalent point streaming functionality in the mainline driver?

@tdl-ua I believe you wrote this point streaming code? Any ideas on whether this issue was present in the branch originally?

@ted-miller @EricMarcil Do you know of any significant changes to the motoman driver which may not play nicely with this point streaming code?

gavanderhoorn commented 3 years ago

Do you know if there's equivalent point streaming functionality in the mainline driver?

no, there isn't.

@tdl-ua I believe you wrote this point streaming code?

no, that was @boneil (#37). @tdl-ua updated it in #215.

Edit: with an updated ROSCONSOLE_FORMAT: where is that error message printed?

Edit 2: seems to be here: https://github.com/ros-industrial/motoman/pull/215/files#diff-91823845753e4d8172e7046fdee7dda955aab677ff8965435d4820933a0f960eR269.

tdl-ua commented 3 years ago

@aa-tom, @gavanderhoorn is correct that I only updated the point streaming interface from two other previous commits that did the heavy lifting.

But I'm fairly certain (it's been over 3 years) that you have to send the trajectory points at 40Hz exactly because that is the rate at which they are processed and executed by the controller. I vaguely remember having a similar issue when using the wrong publishing rate.

See here for a working usage example of #215.

ted-miller commented 3 years ago

Do you know of any significant changes to the motoman driver which may not play nicely with this point streaming code?

I did a brief comparison of the MotoPlus code. Here are the top things that jumped out to me:

4 hour trajectory limit. After 4 hours of continuous streaming, you will need to stop and restart the trajectory stream. (Due to precision of timestamp data type.)
This should not be used with an HC (collaborative) style robot.
There have been significant improvements to the 'in-motion' flag, which is often used for determining the start of the next trajectory. This can lead to erroneous Trajectory start position doesn't match current robot position errors

rr-tom-noble commented 3 years ago

Edit 2: seems to be here: https://github.com/ros-industrial/motoman/pull/215/files#diff-91823845753e4d8172e7046fdee7dda955aab677ff8965435d4820933a0f960eR269.

Yeah, that's the place

But I'm fairly certain (it's been over 3 years) that you have to send the trajectory points at 40Hz exactly because that is the rate at which they are processed and executed by the controller. I vaguely remember having a similar issue when using the wrong publishing rate.

Thanks for the advice. I'll try setting the publishing rate to 40Hz and see if that fixes the issue, although I'd expect that the queue would fill up almost immediately if that were the case? What we're seeing is that the point streamer works fine at a rate of 150Hz for a few minutes before suddenly hitting the issue.

I've also added some extra info statements including queue size and thread timing info to support debugging, so I'll get back with those results when I get a chance to test

rr-david-alatorre commented 3 years ago

To add to what tom said about the publishing rate: We have changed the controller status update period to 4ms (from 25, or 40Hz) to allow us to go faster.

We then tested the point streamer at 200Hz (to match 4ms) but reached max queue elements within a minute. At 150 Hz it took several minutes but eventually filled the queue, so we lowered it all the way to 100 and are still having the problem at random intervals - sometimes after a few minutes, sometimes after a long time streaming.

If this was caused by mismatched rates then I would expect to see the queue fill almost immediately as a mismatch of 1Hz would fill the 20 element queue in 20 seconds; if it is the internal queue that is the problem then I would at least expect consistent times before overflows. Very confusing!

gavanderhoorn commented 3 years ago

To help in diagnosing this, please make a wireshark capture of the traffic between the controller and motoman_driver, then share that here.

Since the streaming extension is using regular simple messages, we can try to see whether something is not working correctly on that level.

gavanderhoorn commented 3 years ago

@aa-tom @DavidAlatorreAA: any luck with the wireshark trace?

riv-robot commented 3 years ago

@gavanderhoorn Normally we would have sent this Wireshark capture immediately, but a few unrelated issues cropped up, so apologies for the delay. Attached is the capture and some notes are:

This was ran on our highest performance PC, overclocked but with real-time scheduling off.
We set the rate to 200 Hz.
The queue size error occurred at UTC 11:00:17.973
In wireshark select View > Time Display Format > UTC Time of Day

motoman_pointstreamer_200hz_queue_110017973.zip

gavanderhoorn commented 3 years ago

Thanks for the capture.

It seems to start somewhere in the middle of an ongoing session though.

Could you capture from the beginning (ie: include drives enablement, etc)?

The queue size error occurred at UTC 11:00:17.973

a packet nr reference is easier to find in a capture :)

riv-robot commented 3 years ago

Here's the capture from the beginning. Packet nr 791435; true that is a lot easier than the UTC goose chase. motoman_pointstreamer_200hz_init.zip Note only the first 250,000 packets are included as the full file was too big. I can send all in separate files if needed.

gavanderhoorn commented 3 years ago

Even from this capture it's clear the ROS side is stopping the motion, not MotoROS.

Specifically, this line triggers motoman_driver to send a Stop Motion command:

ROS-Industrial SimpleMessage, Motoman Motion Ctrl (0x7d1), 68 bytes, little-endian
    Prefix
        Packet Length: 64
    Header
        Message Type: Motoman Motion Ctrl (0x000007d1)
        Communications Type: Service Request (2)
        Reply Code: Unused / Invalid (0)
    Body
        Robot ID: 0
        Sequence Number: 0
        Command: Stop Motion (200111)
        Data (reserved)
            J0:    0,000000000
            J1:    0,000000000
            J2:    0,000000000
            J3:    0,000000000
            J4:    0,000000000
            J5:    0,000000000
            J6:    0,000000000
            J7:    0,000000000
            J8:    0,000000000
            J9:    0,000000000

which the controller then acknowledges.

I'd suggest to add some debug messages which show how/when the ptstreaming_queue_ is popped, which sequence nrs get transmitted and which get acked.

That could perhaps give you some insight into the behaviour of the queue and when its size starts to approach max_ptstreaming_queue_elements.

The rest of the SimpleMessage traffic looks pretty normal to me (with the exception of course that the trajectory never ends).

gavanderhoorn commented 3 years ago

The rest of the SimpleMessage traffic looks pretty normal to me (with the exception of course that the trajectory never ends).

That's not entirely true: your capture shows some missing TCP segments. I'm not sure this is not a red herring, but fi packet 7390 and 7391 (in your full capture) imply something went wrong during transmission of a packet. That would indicate TCP packet loss.

It happens a few more times.

There are Dup ACKs which follow those missing segments, so in the end the bytes probably were received by the destination host, but it's not something I see often in captures of Simple Message traffic.

gavanderhoorn commented 3 years ago

@ted-miller: I just noticed that MotoROS does not seem to copy over the Sequence field from an incoming Joint Trajectory Point Full to the subsequent Motoman Motion Reply packet.

It's set to -1 (at least in these captures):

ROS-Industrial SimpleMessage, Motoman Motion Reply (0x7d2), 76 bytes, little-endian
    Prefix
        Packet Length: 72
    Header
        Message Type: Motoman Motion Reply (0x000007d2)
        Communications Type: Service Reply (3)
        Reply Code: Success (1)
    Body
        Robot ID: -1
        Sequence Number: -1
        Command: Joint_Traj_Pt_Full (14)
        Result: Success/True (0)
        Subcode: Unknown (0)
        Data (reserved)
            J0:    0,000000000
            J1:    0,000000000
            J2:    0,000000000
            J3:    0,000000000
            J4:    0,000000000
            J5:    0,000000000
            J6:    0,000000000
            J7:    0,000000000
            J8:    0,000000000
            J9:    0,000000000

If the sequence number would be copied into the outgoing ack, we could track packet transmission and acknowledgement more easily.

riv-robot commented 3 years ago

@gavanderhoorn We are trying to print INFO statements around the ptstreaming_queue_ pop, as implemented here. However, these statements are not appearing on the console. Do we need to set a particular logger to a different level than default?

That's not entirely true: your capture shows some missing TCP segments. I'm not sure this is not a red herring, but fi packet 7390 and 7391 (in your full capture) imply something went wrong during transmission of a packet. That would indicate TCP packet loss.

It happens a few more times.

There are Dup ACKs which follow those missing segments, so in the end the bytes probably were received by the destination host, but it's not something I see often in captures of Simple Message traffic.

I ran a few tests at 120 Hz (stable) and 200 Hz (unstable) and we get the same behaviour albeit at a lower rate. The queue we are exceeding has a size of 20 and we receive more than 20 TCP missing segments messages, without the queue size error occuring. Perhaps this is a result of running the controller at 4 Hz instead of the default 10 Hz or some of the other sleeps we've reduced in our branch of the MotoPlus code. A "red herring" may well be the correct terminology but I expect if we can get the INFO statements to print we'll have a better idea.

gavanderhoorn commented 3 years ago

We are trying to print INFO statements around the ptstreaming_queue_ pop, as implemented here. However, these statements are not appearing on the console. Do we need to set a particular logger to a different level than default?

I would expect you'd have to configure your .launch file to redirect output to screen.

Or just add --screen to your roslaunch invocation.

I would also suggest to use named loggers, instead of the anonymous versions. Would save you a few all-capital sentences, as you could just filter for that specific logger.

Perhaps this is a result of running the controller at 4 Hz instead of the default 10 Hz

I'm not sure which controller you are referring to here. I'm not aware of anything running "at 10 Hz" by default.

Edit: I'm not sure I understand the changes you linked to. How would you track the nr of elements in the queue over time?

riv-robot commented 3 years ago

I'm not sure which controller you are referring to here. I'm not aware of anything running "at 10 Hz" by default.

Typo I meant 4 ms and 10 ms. And I'm talking about the Yaskawa Motoman YRC1000micro controller. We have the MotoPlus SDK and modified the default CONTROLLER_STATUS_UPDATE_PERIOD from 10 ms to 4 ms.

Thanks for the other .launch screen tip, will test this morning.

gavanderhoorn commented 3 years ago

That would be the period then.

riv-robot commented 3 years ago

Still having some problems outputting those extra INFO statements. Modified the launch file as per below, but only get two statements on the rosconsole:

INFO] [1630665756.395857964] [/yaskawa/motion_streaming_interface]: Starting joint point received. Starting on-the-fly streaming.
[ INFO] [1630665957.986673026] [/yaskawa/motion_streaming_interface]: Point streaming complete,  setting state to IDLE

<node unless="$(arg use_bswap)" name="motion_streaming_interface"
        pkg="motoman_driver" type="motion_streaming_interface" output="screen"/>
</launch>

gavanderhoorn commented 3 years ago

I can only guess you're either:

not editing the file you believe you're editing
you're not rebuilding the driver (correctly) after making those changes

gavanderhoorn commented 2 years ago

@robertjbush @DavidAlatorreAA: what's the status here? Have you been able to make any progress?

riv-robot commented 2 years ago

@gavanderhoorn You beat me to it! I was just in the process of responding today.

We did some refactoring of our code base and we've been running the ROS master on our fastest PC at 150 Hz consistently without problems. At 200 Hz, although we do sometimes get the max queue error, the robot is able to continue moving. Setting real-time flags to the motion_streaming_interface does help.

I was not able to get the extra INFO statements to display. Any help from the community would be appreciated here.

gavanderhoorn commented 2 years ago

we've been running the ROS master on our fastest PC

I'm a bit confused: the roscore is not involved in node communication at all.

Setting real-time flags to the motion_streaming_interface does help.

You mean starting the process with SCHED_FIFO or something similar?

riv-robot commented 2 years ago

I'm a bit confused: the roscore is not involved in node communication at all.

We are also confused as to why this made a difference; but it did (note the motion_streaming_interface node runs on the same PC as the master in our configuration).

You mean starting the process with SCHED_FIFO or something similar?

Correct.

gavanderhoorn commented 2 years ago

For now I'll assume that's correlation and not causation then.

Unless you're re-creating your publishers every loop or something similar. Then I could see this having an influence.

gavanderhoorn commented 2 years ago

I'm going to close this.

We don't officially support point streaming in the current version of motoman_driver, and from the lack of activity here I assume the initially reported issue has been resolved.

gavanderhoorn commented 2 years ago

@rr-robert and @rr-david-alatorre:

Unless you're re-creating your publishers every loop or something similar. Then I could see this having an influence.

could you confirm you're not doing this?

I wouldn't understand what effect it would have on the queue overflowing, but would give some clue as to why roscore could be involved.

ros-industrial / motoman

Point streaming queue has reached max allowed elements #420