ros-industrial / abb

ROS-Industrial ABB support (http://wiki.ros.org/abb)
145 stars 152 forks source link

Trajectory takes a long time to begin execution #142

Closed yamokosk closed 6 years ago

yamokosk commented 6 years ago

I've seen #110. But I'm not interested in workarounds. I want to know what is the root cause of, what I feel, is an extraordinarily long time to execute even a small number of points in a trajectory. I did a really simple test in which I timed (yes, using a stopwatch) how long it took between publishing a ROS trajectory message and when the ABB pendant showed the "received X points" message. For a trajectory with 11 points, the delay was 2.2 sec. For 1000 points, the delay increased to 8.6 seconds.

I'm no Rapid expert but I've tried to trace the message from being received on the Rapid side through to execution and I don't see any glaring problems. I haven't looked as closely at the ROS ABB driver side, but most of that code looks pretty straight forward as well.

For our application, we plan once and then need to execute the same planned trajectory several times. But in our application, the transmission/beginning of execution of the trajectory is taking about as long as the planning does.

gavanderhoorn commented 6 years ago

Just making sure: we are talking about the current abb_driver package here, right?

I've seen #110. But I'm not interested in workarounds.

I'm not sure, but I don't think #110 is actually relevant here. That issue is about the MoveIt Trajectory Execution Monitor cancelling execution of the trajectory as the robot is just too slow (even after accepting the job and having started executing it).

The main issue with the driver is that it uses a download approach: all traj pts have to be uploaded to the controller before it will start executing the trajectory. That will inevitably incur a delay. The reason this was implemented that way was that at the time it was believed that would result in the smoothest motion on ABB controllers (other drivers use the streaming approach, which only requires a single, or a few, points before starting the motion).

Streaming vs Downloading is always a trade-off: the former has no limit to the length of the trajectory, but motion smoothness can suffer (as there is little opportunity for the controller to look ahead), while the latter has a maximum nr of points it can accept and incurs some additional latency, but typically results in smoother motion.

Whether that is still the case on recent ABB RWs is something to check. If you are interested in investigating this we can probably assist you, but all development will be on the RAPID side.

The abb_driver package will most likely see less development in the future, as work is underway to standardise on an EGM based system. abb_driver will be kept around for controllers with older RWs or without EGM, but external motion control with pure RAPID will always suffer from some delays and non-smooth motion.

yamokosk commented 6 years ago

Right I get the trade-off between streaming and downloading but I find it highly suspect to expect a 2+ second delay for a 10-point trajectory. Doing some napkin math, assuming we only had a 10Mb/s network (only because I am not 100% confident its not 100Mb/s), I calculate that, theoretically, it should take somewhere less than a 1 ms in transmission delay for a 10 point message for a 6-DOF robot.

I did some testing and didn't realize that the trajectory is sent just one point at a time! I thought the whole trajectory message was sent at once. I can see why for robots that support streaming this should be the case, but for our needs, that might be where we are incurring a lot of overhead.

I also get that this driver is not where the active development is. But the alternative (EGM) is not an option for us at the moment either (see my issue on that repository).

gavanderhoorn commented 6 years ago

but for our needs, that might be where we are incurring a lot of overhead

and I believe that is partly the crux here: the driver has to solve the general problem of making it possible to command motion from ROS. It is not necessarily the best / optimal way to upload off-line planned trajectories to an ABB controller.

I'm not saying it cannot be improved, but implementing some specially tailored solution will always allow for optimisations not possible when having to cater for generic usage.

I find it highly suspect to expect a 2+ second delay for a 10-point trajectory. Doing some napkin math, assuming we only had a 10Mb/s network (only because I am not 100% confident its not 100Mb/s), I calculate that, theoretically, it should take somewhere less than a 1 ms in transmission delay for a 10 point message for a 6-DOF robot.

network throughput is not important here, it's the RAPID side that is slow.

And the VMs that run these robot-specific programming languages are most of the times incapable of fast/efficient networking. Typical round-trip latencies with RAPID are in the order of 10ms or more.

I did some testing and didn't realize that the trajectory is sent just one point at a time! I thought the whole trajectory message was sent at once. [..] that might be where we are incurring a lot of overhead.

if you are thinking of deserialising msgs and reconstructing the trajectory on the RAPID side, then yes, that has slightly more overhead than sending over a single msg. If you were thinking of network throughput that would again not be the issue.


But it would be good to see whether we can get to the bottom of this. If you could install Wireshark, add the packet-simplemessage plugin and make some captures while sending trajectories, that would give some insight into how long it takes to send a trajectory.

Next to that you could try and use the RAPID Profiler to see where most of the execution time is spent. That would give us actual nrs upon which we could potentially act.

shaun-edwards commented 6 years ago

I don't remember things being this slow. However, because the tcp messages are so small, especially the reply, it's possible that the delay ( TCP_NO_DELAY ) is slowing things down. Any thoughts on this?

gavanderhoorn commented 6 years ago

I don't remember things being as slow as @yamokosk reports either, but perhaps something has changed (in RW fi) or there is some configuration issue.

Either way gathering some data is probably the fastest way to pinpoint any bottlenecks - if there are any.

it's possible that the delay ( TCP_NO_DELAY ) is slowing things down

Are you saying it isn't enabled, and should be, or it is enabled and shouldn't be?

shaun-edwards commented 6 years ago

I don't believe TCP_NO_DELAY is enabled on either side. It's easy enough to add to the ROS side (if it's not there already). Robot controllers don't typically expose this option. One way to "force" the option is to make the reply messages really big.

yamokosk commented 6 years ago

I'm back with some data. As anticipated the delay is not in the transmission. First the packet capture data:

A long trajectory (~130 pts) was commanded 5 times. Packets were captured and the data was analyzed. For each trajectory, the following events were extracted from the captured packet data (attached): trajectory send start, trajectory send finish, and start of motion:

Event, packet sequence number, timestamp
Start sending trajectory, 3275, 33.7526
Finish sending trajectory, 3527, 33.7613
Motion initiated, 3623, 34.6976

Start sending trajectory, 5075, 50.1515
Finish sending trajectory, 5246, 50.1559
Motion initiated, 6272, 61.0706

Start sending trajectory, 7651, 75.6488
Finish sending trajectory, 7876, 75.6515
Motion initiated, 8939, 86.5598

Start sending trajectory, 11524, 113.45131
Finish sending trajectory, 11650, 113.45597
Motion initiated, 12677, 124.3170

Start sending trajectory, 14753, 146.14915
Finish sending trajectory, 14987, 146.1525
Motion initiated, 16085, 157.06171

The results from this revealed that,

Trajectory transmission delay (min/avg/max): 2.7/4.8/8.7 ms
Delay between end of transmission and beginning of motion (min/avg/max): 936.3/8905.9/10914.7 ms

Several other captures were performed but these results are pretty representative. The trajectory transmission delay is essentially negligible. And the delay between end of trajectory transmission is either small (~1s) or quite large (~10s). Out of the 5 trajectories executed, 4 had delays over 10s and one was less than a second.

I also captured data from the Rapid Profiler. As far as I can tell, that was much less useful. I don't really see any red flags in time spent in any of the functions. But maybe someone more knowledgeable will. The profiler data was from the execution of 3 of those same long trajectories.

TASK Procedure Time incl subs (ms) Time excl subs (ms)
ROS_StateServer ROS_send_msg_joint_data 0.0 % [0 ms] 0.0 % [0 ms]
ROS_StateServer send_joints 0.3 % [1 ms] 0.3 % [1 ms]
ROS_StateServer main 100.0 % [353 ms] 99.7 % [352 ms]
T_ROB1 clear_path 0.4 % [289 ms] 0.4 % [289 ms]
T_ROB1 init_trajectory 0.2 % [148 ms] 0.0 % [2 ms]
T_ROB1 abort_trajectory 0.5 % [362 ms] 0.1 % [71 ms]
T_ROB1 main 100.0 % [73765 ms] 99.5 % [73403 ms]
ROS_MotionServer ROS_receive_msg 100.0 % [9 ms] 100.0 % [9 ms]

test_data.zip

shaun-edwards commented 6 years ago

@yamokosk, thanks for providing the data. I couldn't have been more wrong. One thing to consider is that somewhere (I don't know where), I believe the robot executes a very, very, slow move to the first point, if the first point is not within a certain range of the robot's current position. Perhaps the variability of motion execution is related to that.

gavanderhoorn commented 6 years ago

The first long trajectory (127 pts, can't use the pkt nrs as you've only exported simple_message traffic) has quite some difference between the start pt and the curent state of the robot (at least when looking at the immediately preceeding JOINT_POSITION msg). If that is out of the bounds for the check, it'll take up tp 10sec for the robot to get to that point.

Default tolerance is 0.1 degrees (here), so that could be enough to trigger the 'move-slowly-to-start-point' behaviour in this case.

How are you generating these trajectories @yamokosk?

Btw: 122 points is longer than the default 100 (MAX_TRAJ_LENGTH), did you change the buffer length @yamokosk?

gavanderhoorn commented 6 years ago

@shaun-edwards wrote:

One thing to consider is that somewhere (I don't know where), I believe the robot executes a very, very, slow move to the first point

The check is here, and it's more an implicit move-slowly-to-first-point than a deliberate slow motion (at least on the rapid side). The first trajectory point gets a duration of 10sec from the industrial_robot_client nodes, and the logic on the rapid side just check whether a point could be skipped. If the first point is not within current_state-threshold, that will result in moving a (typically) short distance in multiple seconds.

yamokosk commented 6 years ago

Just last night I riddled the code with print statements to see if I could narrow it to a functional call. I can share the data if needed, but it was pretty clear that when the ~10 sec delay did happen, it was always with the first two MoveAbsJ commands (they took ~5s and ~4s respectively to return).

So I instantly went back to the packet capture data and saw the 10s duration in the first trajectory point... I guess I should have started there.

These trajectories are being generated by MoveIt (with plans to move to Decartes for some of the plans) with several post-processing filters, including AddTimeParameterization, UniformSampler, and the nPoint filter. But most importantly, we are not planning from our start state as our application doesn't allow it. Two quick solutions I can think of,

1) tighten up the zonedata in the Rapid code: Would this not make the robot end its movement closer to the pre-planned state and not trigger the 10s slow move? 2) loosen the slow move to first point check in the ros industrial code

yamokosk commented 6 years ago

The only place I could find that adds a 10s duration is not related to a check on the start state and trajectory point, but rather has to do with the timing of the trajectory point:

https://github.com/ros-industrial/industrial_core/blob/d14029bdf0b403ef4412ef8a7b16bbb43a0a7627/industrial_robot_client/src/joint_trajectory_interface.cpp#L288

I don't think the time parameterization is giving me points in time that are out of order, but I can check that. @gavanderhoorn can you point me to where ROS industrial is modifying the duration based on being out of tolerance of the start state?

gavanderhoorn commented 6 years ago

@yamokosk wrote:

@gavanderhoorn can you point me to where ROS industrial is modifying the duration based on being out of tolerance of the start state?

No, I can't, because it isn't checking anything :)

The first point will always be assigned a 10 second value for duration. The rest is implicit on the rapid side: if there is a large enough deviation, the motion will not be skipped.

But that is how all trajectory points are handled, there's nothing special about the first one.

Edit: industrial_robot_client is also not modifying timing of your trajectory. It's just encoding the start state as a motion to the first traj pt with a duration of 10sec.

gavanderhoorn commented 6 years ago

Btw: you haven't confirmed whether you've changed some of the limits on the rapid side.

gavanderhoorn commented 6 years ago

@yamokosk wrote:

But most importantly, we are not planning from our start state as our application doesn't allow it. Two quick solutions I can think of,

  • tighten up the zonedata in the Rapid code: Would this not make the robot end its movement closer to the pre-planned state and not trigger the 10s slow move?

it probably would, but I'm not sure that is really important. The issue (or: cause really) here is that you're not planning/starting execution from the current robot state.

Tightening the zoning will most likely lead to jerkier motion.

  • loosen the slow move to first point check in the ros industrial code

As I wrote above: there is no check on the ROS side. It's all on the rapid side.

You could completely remove the is_near(..) check. That would open you up to the risk of the robot jumping to the first point of your trajectory in a relatively unchecked way though (as it essentially will have very little time to get there).

You could also treat the first point differently: it has a special sequence nr, so you can detect it occuring in your data stream. What would be a good course of action then is something to perhaps discuss.


PS: I'm not claiming the current implementation is perfect btw, not at all.

yamokosk commented 6 years ago

Ahh, got it. So the first point is always assigned 10s and its up to Rapid whether or not to skip it. Ok, makes sense now. Hmm.

There is nothing I can do about the cause. Based on our application requirements, we can not spend time planning where we want the trajectory to start.

Can someone point me to the code where the first trajectory point gets a 10s duration? I understand why it was done, but it feels a little dirty to be modifying a trajectory like that and then relying on a skip_move check to filter it out. Would it not be better to leave the trajectory alone and simply have the Rapid driver report an error back if the current state of the robot and first trajectory point differed by X amount?

I think MoveIt already does something similar? when it is ask to monitor the trajectory execution.

gavanderhoorn commented 6 years ago

@yamokosk wrote:

Can someone point me to the code where the first trajectory point gets a 10s duration

You already found it: ros-industrial/industrial_core/industrial_robot_client/src/joint_trajectory_interface.cpp, L288.

but it feels a little dirty to be modifying a trajectory like that and then relying on a skip_move check to filter it out.

pedantic perhaps, but as I noted earlier, the trajectory itself is not being modified. Note that even if you remove this, your robot will still not do what the trajectory encodes: because the start state deviates from the state at the time the trajectory starts executing, the first motion of the robot will be from the current state, and it will use the duration specified for the first traj pt to move to that pose. As I wrote above, depending on the magnitude of the deviation, that can lead to a significant jerk, as the robot has to potentially cover (quite) some distance in (very) little time.

Note also btw that the skip works for all trajectory points: there is no point in moving to a traj pt that is too close to where the robot currently is. Due to zoning and other effects, it would just be filtered out by the motion engine of the robot controller anyway.

Would it not be better to leave the trajectory alone and simply have the Rapid driver report an error back if the current state of the robot and first trajectory point differed by X amount?

That would make it more explicit, yes, and is the approach taken by motoman_driver and ur_modern_driver. But at the time this code was written (quite some time ago already), that infrastructure was not available. Note that if such a system was in place, it would probably not execute any of the trajectories you've captured with wireshark, as they all deviate, so you'd have to completely disable the system, or loosen up the threshold.

ros-industrial/industrial_core#118 is related here. Note that currently, queueing and execution of motions are two different things. Perhaps for a downloading driver queueing should fail if the trajectory doesn't start at the current state of the robot. Alternatively queueing could succeed, but at the start of execution the robot should go into some sort of error mode. That would avoid having downloading and streaming drivers behave differently.

yamokosk commented 6 years ago

Interesting, I will digest that issue when I have some time. And yes, there are two issues here. 1) ROS-industrial code quietly modifying the trajectory and the is_near check in Rapid being used to filter this out. And 2) Our state state does not equal the trajectory start state. This one is on us. We have identified the issue and are fixing it.

I will submit a PR in a few days with what I think is a more explicit way to handle this.

gavanderhoorn commented 6 years ago

@yamokosk wrote:

1) ROS-industrial code quietly modifying the trajectory

this should be made more explicit, I agree (but again: it's not modifying the trajectory per se)

the is_near check in Rapid being used to filter this out.

I believe that check is actually quite useful, as, as I wrote above, submitting points to the motion engine that already fall within the tolerance does not make much sense (it will actually just slow things down).

I will submit a PR in a few days with what I think is a more explicit way to handle this.

Great, much appreciated.

Just a suggestion: it might be good to open an issue to discuss your approach before investing too much time in it. That way we can avoid potential frustration on both sides, but mostly yours.