jointTrajectoryCB can quietly fail without alerting higher level components

pbeeson commented 6 years ago

I have a situation when using the motoman driver where the driver may on a rare occassion see the current joints as too far from the first joint values in a trajectory. Thus, the is_valid() call from industrial_core/joint_trajectory_interface/trajectory_to_msgs() is false. This makes trajectory_to_msgs() return false, and then JointTrajectoryInterface::jointTrajectoryCB() simplt returns without sending anything to the robot. But, given that this is typically run with the joint_trajectory_action server calling this, my custom actionlib client s has been told by the JointTrajectoryAction::goalCB() actionlib server that the goal was accepted and the status is RUNNING, depite the fact that the actual control is never passed to the robot. So, my client will sit forever waiting on completion (or in practice will timeout first) without ever getting any kind of useful feedback from the system as to why the accepted goal was never achieved. There needs to be some way in JointTrajectoryAction to cancel the goal when send_to_robot() in the JointTrajectoryInterface is not called.

Levi-Armstrong commented 6 years ago

I am not aware of a start state check within industrial_core and I checked the is_valid() function and I don't see where it is checking the start state. I do know that the moveit trajectory_execution_manager does this before sending to the JointTrajectoryAction client. You can configure the allow start state tolerance by adding the line to your moveit_config packages trajectory_execution.launch.xml file. The default is 0.01.

 <param name="trajectory_execution/allowed_start_tolerance" value="0.01"/>

pbeeson commented 6 years ago

1) I don't want this to use MoveIt. 2) I'm not looking for a start state check, there already is one for Motoman 2) The is_valid() that checks the start state may be specific to the Motoman driver (it happens in that driver's joint_state_streamer), but regardless -- this is a design flaw in the industrial core data flow. The joint trajectory action lib server gets a goal and accepts it then publishes a topic. The joint_trajectory_interface sees the incoming trajectory on that topic and at some point between there and sending the command to the robot, it may not send. Right now there is no feedback to the actionlib client that a goal has been canceled or aborted when this happens.

Levi-Armstrong commented 6 years ago

Well it looks like this issue should be created on the Motoman repository since it maintains a modified version industrial_robot_client. Here is where the check is happening. I am looking into the action lib server not getting notification now.

pbeeson commented 6 years ago

By default the Motoman driver launch files don’t use the modified industrial robot client. I’m not using that.

On Wed, Dec 20, 2017 at 1:00 PM Levi Armstrong notifications@github.com wrote:

Well it looks like this issue should be created on the Motoman repository since it maintains a modified version industrial_robot_client. Here is where the check https://github.com/ros-industrial/motoman/blob/kinetic-devel/motoman_driver/src/joint_trajectory_streamer.cpp#L504 is happening. I am looking into the action lib server not getting notification now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ros-industrial/industrial_core/issues/187#issuecomment-353152460, or mute the thread https://github.com/notifications/unsubscribe-auth/ADrfxmRGuhIEzPXxOPbB-hR_HyOHaITMks5tCVlPgaJpZM4RHNmS .

Levi-Armstrong commented 6 years ago

Which launch file are you using and what version (Kinetic, Indigo. etc.)?

By default the Motoman driver launch files don’t use the modified industrial robot client. I’m not using that.

If you are using the robot_interface_streaming_xxx.launch then yes the default is as you say, but if you are using robot_multigroup_interface_streaming_xxx.launch the custom version is used.

Regardless I don't believe the issue is with the joint_trajectory_action node, I believe the issue is in the custom joint_trajectory_streamer.cpp. When the joint_trajectory_action node receive a goal it publish the trajectory on a topic and at that point there is no way to cancel the goal. It appears that the joint_trajectory_action node then monitors a feedback topic for updating the joint_trajectory_action state. I believe here and here may be the issue. If you notice above, when the received trajectory is empty it execute trajectoryStop(); before returning which I believe makes it way back to the joint_trajectory_action node and update the state. I think if you add this command before returning at the identified lines of code above, this will make its way back to the joint_trajectory_action node. Now that being said I have not tested this and this is purely based on digging through the code. Also if this does work, I am not sure this the right fix but it will help identify the issue and provide the necessary information for implementing an appropriate fix. I may be able to test this tomorrow.

pbeeson commented 6 years ago

I'm using the robot_interface_streaming_xxx.launch. I'm using a branch in a forked repo that incorporates I/O signals to the Motoman controller and has other fixes we needed to get a Motoman working in production.

But as you notice, the Motoman is orthogonal to the issue. In my initial comment, I alluded to the fact that I thought JointTrajectoryInterface::jointTrajectoryCB() is the culprit, but if it is the job of the streamer, not the interface, to provide feeedback, then yes, the motoman driver probably needs to provide feedback to inform the actionlib server that it needs to abort. My hack right now actually publishes an empty message when this specific event occurs (starting joints too far) so that my execution client can see it, preempt the actionlib server, tack on a new first waypoint using current joint angles, and resend the trajectory goal. It's not a general solution.

Levi-Armstrong commented 6 years ago

Ah, ok now we are getting somewhere. So the trajectoryStop(); does what your current fix does which publishes an empty message, which based on your comment does make its way back to the actionlib server.

@shaun-edwards @gavanderhoorn Do you have any thought on the best way to address this issue? We could add the trajectoryStop(); when trajectory_to_msgs call fails because as JointTrajectoryInterface::jointTrajectoryCB() is now as @pbeeson identified it returns and the actionlib server is never notified.

shaun-edwards commented 6 years ago

@pbeeson is right, this is definitely a design issue. If I understand correctly, the action eventually reports a timeout, but the desired functionality is for the action to fail immediately. The work around has always been to make sure this situation doesn't happen. This begs the question of why you are sending a trajectory with an improper start point (I'm guessing you aren't doing this on purpose).

What if we created a service that replicated the functionality of the topic, but because it was a service, could return failure if the trajectory failed to start. Note, the service would only initiate the motion. It would not block until the motion was complete (that's what the action is for).

Thoughts?

pbeeson commented 6 years ago

@shaun-edwards In the 200ms from when I start planning to when the trajectory gets checked deep inside the motoman driver, a single joint has move > 5e-5 radians. This is happening after 2 months of running in production. So WHY this is happening (mechanical, etc.) is still unknown, but in debugging, it highlighted this design flaw.

pbeeson commented 6 years ago

I like the service idea, because then we could parameterize the type of failure and provide that in the actionlib feedback. Because, of this type of error, I'd like to update my first trajectory point and retry, but on any other failure, I want to bomb out of my entire state machine.

shaun-edwards commented 6 years ago

I have seen joints "float" before. My quick fix has been to increase that tolerance on the robot side. After all, 0.00286478898 degrees has always seemed a bit small to me :)

pbeeson commented 6 years ago

I wasn't sure if that 5-e5 was something that Motoplus (on the controller) also assumed, so I didn't want to muck with that constant.

We check this range on our side too, but there are cases where it passes our test then shortly afterwards fails the test below.

I have a hacky fix for now for this particular application.

shaun-edwards commented 6 years ago

5e-5 is in the motoplus application....it needs to be changed on both sides.

shaun-edwards commented 6 years ago

As I see it there are workarounds to the problem, but the real fix is one of two options:

Mirror ROS topic capability to a ROS service (as discussed above)
Combine the joint trajectory action with the industrial robot client, or simply support that action interface directly in the industrial robot client. This is actually my preferred solution. It's less ROS-like, but it allows for better support of the action interface (better feedback, canceling, etc...).

Levi-Armstrong commented 6 years ago

@shaun-edwards So it looks like the action interface monitors feedback from the robot on a callback and the following trajectory callback for when the trajectory is finished but there is no way for subsequent ros node to notify the action of a failure. Also we noticed that when the e-stop is pushed the action is not handling this either. On Friday we tested adding a subscriber to the action and a publisher to JointTrajectoryStreamer so it has a way to notify the action that it was rejected. After the holiday we will clean it up and create a pull request. We also discussed incorporating everything within the action but we did not have time to look into the affect it would have on the greater ros community. Do you know if this could be done in kinetic or would it need to be done in lunar?

pbeeson commented 4 years ago

I'm going to close this as it might have already been fixed in a PR, but if not, it's pretty outdated and unlikely to be fixed prior to everyone just moving to ROS2

ros-industrial / industrial_core

jointTrajectoryCB can quietly fail without alerting higher level components #187