Jerky moves - proposed improvements of the 500Hz control loop for position control

ros-industrial / ur_modern_driver

(deprecated) ROS 1 driver for CB1 and CB2 controllers with UR5 or UR10 robots from Universal Robots

Apache License 2.0

299 stars 341 forks source link

Jerky moves - proposed improvements of the 500Hz control loop for position control #153

Closed potiuk closed 5 years ago

potiuk commented 6 years ago

Here is an outline of proposal and request for comments for the - apparently common - "jerky" moves problem that we have when trying to control UR robot with position control via ur_modern_driver.

Like many others, when we try to use direct joint position control of the ur_modern_driver with the recent firmware of UR (3.4.5) we have a problem with "jerky" moves of the arm. Those seem to be related to the way 500Hz (4*125Hz) control loop is implemented in the driver, and while there are some ways this can be optimized (TCP/IP socket options, low-latency kernel on host), it seems that the problem still persists and whenever the host PC starts being busy doing other things (for example reading camera streams) the problem is magnified. Also using low-latency kernel on host might not always be feasible/development friendly. We are using refactored version of the driver with TCP_QUICKACK changes pulled (https://github.com/Zagitta/ur_modern_driver/pull/4) and I asked about the problem in ROS answers - https://answers.ros.org/question/276854/jerky-movements-of-ur10-robot-with-ur_modern_driver-and-moveit/ and got some comments pointing to the refactored branch/TCP problems, Despite applying the solutions, we still experience some jerkyness (much less than before) .

So I took a close look of the underlying code and I think the current design for the position tracking is a bit flawed and it can probably be improved with relatively small effort. The company I work for (still stealth mode - http://nomagic.ai) would love to invest some of our engineering time to improve it, test it with our UR10 and contribute the solution back, but before we spend more time on it we would like to verify some of our hypothesis with the people involved so that we know the ideas we have are plausible. We do not have a full context of the original implementation and maybe we misinterprete what's going on, so maybe you can help. @ThomasTimm @Zagitta @gavanderhoorn - I'd really love if you can comment on this. I just need a help to understand of limitations/constraints - we will take it from here and do implementation/testing ourselves.

There are few hypothiesis/ideas that we have - please let me know If I got it wrong/right :).

1) 4 x Oversampling Currently the ur_modern_driver interpolation loop performs cubic interpolation based on tracejectory positions and velocities (basically Bezier curve calculation) for 0.008/4 = 0.002 s intervals from the start of the move and sends those over reverse TCP/IP connection (port 50001) opened by the UR Script program. There is no feedback information, so ur_modern_driver does not account for the actual robot joint positions, it assumes that the robot is following the trajectory closely. Those updates are then applied using servoj() command by the URScript. That's the main reason why there is the fast "catching up" if some of those updates get delayed.

Question: do we really need oversampling here and why it was needed in the first place? The URScript documentation mentions that we should control the robot with 125 Hz frequency (so tell robot what to do every 0.008 s). Seems that we are trying to do it every 0.002 s - and the documentation from UR does not say what happens if we do this. In fact if I look at the docs, it looks like the servoj command will actually block for those 0.008 s and will take the latest value that was written to the cmd_servo_q variable. If I understand it correctly, the host part of the driver will calculate the positions as fast as possible, and the ur_script will use the latest value set every 0.008 s (or as often as it wakes up). Do I understand correctly? What wrong could happen if we decrease oversampling ? I am going to try it out, but I would like to understand rationale behind the hardcoded "4" oversampling ratio.

2) Why do we use TCP/IP communication and host interpolation? I looked at the code of both - Host part and URScript part and it struck me that the host part does very little work (just very tight loop of interpolation calculation based on moveit-provided trajectory). But the communication overhead between those two is huge (500Hz). If I understand it correctly (see point 1), most of the positions calculated are discarded and overwritted by subsequent ones. The question is then - why do we use the host <> UR tight 500Hz loop at all? It seems perfectly reasonable (and fairly straightfoward) to implement the whole loop fully in URScript rather than rely on host and TCP/IP. That should completely eliminate the packet delay problem.

There are few reasons I can think of why it might be difficult/impossible:

lack of time functions in UR Script (I could not find any)
lack of math functions in UR Script (I doubt it - it seems to have all we need)
too low speed of the UR Script to make the calculations (I seriously doubt it)
threading model of the UR Script making it too complex (what happens when the time slice of 0.008 se runs out? - it's not clear in the docs)
the need to keep alive the URScript program by TCP messaging (? would be strange).
difficulty of debugging/testing the URScript solution

If none of the above is blocking, I think I'd love to implement the improved URScript with complete control loop inside a thread in UR Script. Do you think there is something from the list (or outside of it) that blocks us from doing it this way?

3) Different commands (than servoj) to control robot There are a number of other than servoj in UR Script that we could use to control the robot. Notably movej/movel/movep/movec (with blend). They cannot be controlled between trajectory points of course and there are potentially non-smooth joint movements at joint points, but maybe if we use them, the overal impression will be smooth. Unlike the interpolation we use now, they do not take velocity into account, but maybe it will be "good enough".

Can you think of other drawbacks of using move series of commands instead of servoj? Do you think move commands are feasible?

Looking forward to your comments! It would be great if we could contribute to improve the driver.

potiuk commented 6 years ago

Update: here are some videos that show the behaviour we observe:

original driver: https://youtu.be/qCQEsYSB4cs
refactored driver with TCP/IP optimisations: https://youtu.be/Gh-3foD9-6c

I have also tried to disable oversampling (so the interpolation was calculated every 8ms) and the results were not nice. The move got totally out of control (it was supposed to finish just above the table and it shoot up extremely fast ending with protective stop): https://youtu.be/58iITB4IzNk . So I understand that oversampling is needed, now I would love to understand what's going on.

gavanderhoorn commented 6 years ago

Some background info on the various motion primitives in URScript: UR10 Performance Analysis (tech report by @ThomasTimm and his group).

ThomasTimm commented 6 years ago

First off, I don't have access to a UR with the newest firmware, so I can't test this issue - if anybody feels like sponsoring one and thus help me maintain the driver, please reach out to me.

In general, to answer you questions: 1) You are right that no feedback is provided from the robot when using the action interface. The UR controller should track the trajectory fairly close with the use of servoj. You have slightly misunderstood how the driver works in one crucial aspect - I think it would ease understanding in the following if you print out the urscript code that is running on the robot (generated in line 148 - 223 in ur_driver.cpp). In driverProg a function called set_servo_setpoint(q) is defined. This function updates the global variable cmd_servo_q with a new setpoint whenever called. It also defines a thread servoThread(). This thread repeats reading the global cmd_servo_q and calls servoj with that setpoint. This thread runs independently of network communication, which is handled in the main thread. Here, data is read from the socket and fed to the set_servo_setpoint function. Thus, we update the target setpoint every 2 ms, but only reads it every 8 ms. This is done to minimize latency, as the setpoint for servoj is at most 2 ms old. In lack of a better word I call it inverse oversampling, as you are not reading faster than data is changing (oversamling), but rather changing the data faster than it is read. Ideally, you would do this much faster, but that would increase network load linearly. If your computer/network can't keep up with the 2ms update rate you could (besides investing in better hardware) update the setpoint at a lower rate, with the increased latency that follows. Note that you should still update the setpoint more often than servojis called, otherwise you will really start to see jerky movement (as you noticed). Nyquist recommends samling faster than 2 times your data speed (so faster than (not exactly) every 4 ms), but if you want to do slower, you can experiment with increased servoj time - although UR likes to change how that affects the robot from firmware version to firmware version. Your computer and the controller most likely doesn't agree on how long 4 ms is, so I wouldn't recommend going any slower than 3.5ms

2 and 3) The reason for doing the interpolation on the host instead of on the robot (in which case you could, as you argue, just as well use movej, as that probably does the same interpolation) is to be able to handle an abort action or new trajectory action gracefully. As we found in the UR10 performance analysis that @gavanderhoorn linked to, none of the move* commands handles a stop command very neatly (see page 27). These test were done on firmware v 1.8.14035, so this might have changed. I can only encourage you to do the same tests on a newer robot. Also, remember that a ROS action interface is not a fire-and-forget interface. As the interpolation happens on the host, it is much easier to manage (update/abort) the goal there as well. Last, but definitely not least, it is a matter of safety. If you send the entire trajectory (which might be several seconds/minutes long) to the robot, and then something happens so you lose connection to the robot, it is not possible to stop it (except with the physical E-stop). With the current implementation, if the stream of new positions stops, the robot will also stop.

It is also a matter of reusing code for maintenance. The current servoj loop is also used for the ros_control-based position controller and, as mentioned above, is the only way to stop the robot neatly on a abort_action with firmware v 1.8 (and maybe into the 3.x). Thus the current code would still have to be included and maintained.

Why so many experience problems with jerky movement on 3.4, I don't know (as I said, I don't have access to a newer robot). My recommendation would be to analyse the network traffic both on the ROS computer and on the controller side (it's an ubuntu machine, just log in with root/easybot and install wireshark) and verify that new poses are sent from the ROS computer and arrive at the controller at a steady 500 Hz. If your ROS controller doesn't transmit at a steady 500 Hz, then that is your problem and you should look into buying some more powerful hardware (or just use a RPi as an intermediate ROS node that only handles robot communication). If your ROS controller transmit at a steady 500 Hz, but the reception on the controller is lumpy, then you probably have some issues with packet losses or buffering in your network. Try a direct cable from the ROS computer to the robot (This should actually always be the preferred way to streamcontrol hardware via network).

gavanderhoorn commented 6 years ago

@ThomasTimm wrote:

Thus, we update the target setpoint every 2 ms, but only reads it every 8 ms. This is done to minimize latency, [..]

just to clarify: which latency are you referring to here? (potential) network latency, controller/motion execution latency or a combination?

ThomasTimm commented 6 years ago

It is the delay from the position is sent from the computer until the robot is at that position. in http://ieeexplore.ieee.org/abstract/document/7424304/ I call this the actuation delay.

potiuk commented 6 years ago

Thanks a lot for the explanation @ThomasTimm and the analysis doc link @gavanderhoorn. I already found out the performance report before and It's been very useful to understand some of the limitiations. I also understand that there were a number of changes to the UR firmware since it was written (including some changes to the API - includng servoj_gain, and lookahead parameters in servoj for example) and I read it with some reservations that it might be out of date.

I indeed initially misunderstood how the oversampling works, but since then I realised that it works exactly as you explained (I like "inverse oversampling" name :D ).. Thanks for that explanation! It makes me more assured that I better understand how it works and can reason about potential solution better. That's exactly what I was hoping for when I opened the ticket. I also did experiments with different oversampling rates, and indeed around 4ms mark the moves become smoother (and above 4ms it gets really bad) - so your explanation is perfectly matching the observations.

It's great to know about the reasoning behind those design decisions - the need to handle stopping and updating the goal mid-flight is really important! Many thanks for that!

Regarding the setup of ours. We are controlling the robot from quite powerful laptops (8 core Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz), lots of RAM, fast SSDs. I also used direct cable connection and static IP configuration to the UR10 PC during my tests. Also - unlike for others - unfortunately the low-latency kernel did not help at all. The important part - we run all of the ROS code (including the driver) in Docker container, which might be another layer of problem, although we use --network=host which in essence means that we are using the network on a native speed/latency without any containerisation penalties. I know you exclude running driver in virtual machine, but Docker contenairisation is quite a bit different beast) And we have friends reporting the very same problems without Docker, so it is unlikely that Docker introduces the problem (but I am going to check it soon).

I have a list of things to test next: runnig it without Docker for sure, running the driver on separate machine (actually we even thought about running it on the UR PC rather than on separate machine to get it even closer to the "metal"). Then checking the packet /stability of TCP/IP frequency is a good idea and I am going to take a close look at that. Another check will be indeed to try to see if we can use other commands including changing the goal and stopping in the middle). I am also planning to plot actual/target positions and see how they follow each other (with slight modification of the driver as seen in one of the older issues). It might take some time before we can conclude alll the tests though (I will continue updating this issue with my findings).

However i start to think that the problem is simply too much relying on TCP/IP low-latency in this case. The main problem I see that there is no synchronisation of time between host and UR PC.Calculating positions on host is done in "well known" time from the start of the move, but the actual servoj command for the particular position might execute with pretty much arbitrary delay after it was calculated (up to 20ms it seems from your performance analysis). Inverse oversampling of course helps, but If you imagine for example that there are just 4 positions queued somewhere on the way every now and then - it's easy to understand where the jerky moves could originate from in this architecture. I think that even some smart UDP communication might give better results (sending position + timestamp and only saving the value if timestamp>current). Unfortunately the way TCP/IP works for small packets/messages - even if you have one packet to retransmit and you already sent several following packets, the whole undelivered set of packets will be buffered and there is no guarantees on how fast the lost packet is retransmitted, so it's really easy to get a number of subsequent messages delayed by single collision on the wire on Ethernet level. Using UDP you can avoid this queuing problem easily. But we have no UDP in URScript :(.

When you look in detail how TCP/IP works- you have to realise how many layers you have to pass on both Host and UR PC and physical link (from high level protocol, to low level drivers, and even potential collisions/retransmitions in the ethernet layer). Other user space/kernel/hardware interruptions might get in the way at any time and there is no guarantee on low latency for single packets/messages nor good control over it. There might be a number of components/layers (both in host and the UR PC/firmware) that might influence it. We have no way to control options when we open reverse connection in the URScript, and we do not know what kind of TCP/IP options are used. There is even note in the official URscript docs: "Note: The used network setup influences the performance of client/server communication. For instance, TCP/IP communication is buffered by the underlying network interfaces"

We do not even know what kind of delays might be introduced on the UR PC side of TCP/IP protocol even if we use low-latency kernel on the host. We don't have Gb Ethernet (it's 100Mb!) on UR. They continue adding new features that might break some old behaviours and asumptions. I can for example imagine that RTDE interface introduced in 3.3 has much higher priority than other connections (especially customly opened from URScript). And we have to remember that the very same TCP/IP stack and wire is at the same time used to continuously send back joint states/robot information @125Hz. I can easily imagine occasional netwotrk hiccups.

The random hiccups make it rather unsuitable for the "semi-production" setting we plan to use the UR for. We need a bit more predictability.

I am happy to hear that there are no fundamental llimitations of URScript that made you choose the architecture over running it all in the URScript. And I understand why you chose the way when you did. I really think however, that in our case the stability of moves and predictability might benefit from interpolating the position (or even using movej if we can make it stop nicely) in UR Script. I can also imagine very easily how we could provide protection you talk about. We could simply send an occasional (much less frequent) heartbeat from the host and check periodically in the URScript control loop if it was send in one of the last few iterations) - one of my favourite leaky-bucket algorithm would nicely fit here. That would make much smaller frequency of interpolation calculation (always and only when needed - just after previous servoj command finished). If we have a good source of time in the URScript, it would be pretty much always accurate at every loop with virtually no delays, also TCP/IP communication overhead would be much, much lighter. I'd imagine using a new URScript for it, and keeping the old one for ROS control - it would be fairly straightforward to have two different implementations of TrajectoryFollower.cpp (building on top and following similar patterns of the nice refactor by @Zagitta). It is around 200 lines of code in total, so I think that's rather doable and maintainable to keep both of them.

Do you think it might make sense overall to try it out ? Any other thoughts on feasibility of it ?

potiuk commented 6 years ago

I slept over it and as often happens I have an idea how I could simplify it, dropping assumption about having good source of time in URSim and not having to move interpolation to URSim. I think we can change the code just slighlty and get much better behaviour and much less TCP/IP overhead and jerkyness.

The rough idea:

We continue calculating the interpolation with 0.008s interval. We can also oversample it if we want and do the calculation with say 0.004s interval. However we should do it upfront and assume fixed intervals rather than check how much of real time passed since start. So as result we will have an array of positions in [0, 0.0008, 0.016, ... ] time scale.
We send the whole calculated position trajectory in one big message to URScript. Comment: - we could actually even bake it in the URScript we send - I am not sure if there are limitations of the URScript size - the trajectory might be quite big and the array is an array of 6 joints. Having it baked in the script migt make it much faster to start and (barring the heartbeat/protection) we won't need the reverse connection (maybe i could find a way to make the heartbeat implemented over on of the standard URScript interfaces)
Then in the UR Script my control would look like (pseudo-code):

Control Thread:
   for I  in range(len(ALL_INTERPOLATED_POSITIONS)):
       SERVOJ(INTERVAL_TIME, INTERPOLATED_POSITION[i])
       if (HEARTBEAT_NOT_RECEIVED_FOR_SOME_TIME):
           DO_BRAKE
Heartbeat Thread:
    while not FINISHED:
        READ MESSAGE via TCP.
        UPDATE HEARTBEAT STATUS

Consequences

I am not 100% sure how threading behaves in the URSim, but from what I understand the control thread and servoj command will get high (near-realtime) priority and each loop is almost guaranteed to execute in exactly INTERVAL_TIME (checking the heartbeat will be a little more complex than the pseudo code, but it will be simple +/- calculation of leaky-bucket on heartbeats - I can fine tune it to be checked every 10th iteration or so. There might be occasional delays but they should not be frequent. Even if there will be an interrupt of any sort/other thread takes over for a while between servoj commands, it will mean simply slowing down this particular segment of the move a bit, but it will never lead to a dangerous over-shooting - we will never tell the robot to move the positions too far from the last position - they will always be in pretty much expected position from previous servoj command and we will always incrementally do one INTERVAL_TIME segment at the time.

In case of any delays, the end result will be that trajectory execution might take a little longer to execute than originally planned and the robot will occasionally slow down, but It will always go through all the interpolated positions (which we want) and it will never overshoot (it is guaranteed in this design that there is no sudden catch-ups no matter what delays we have - those we want to avoid). I think it makes it a very good characteristics for what we want to achieve.

What do you think @ThomasTimm ? Any watchouts/concerns for this design that you can see?

Zagitta commented 6 years ago

@potiuk if you're planning to embed the trajectory directly in the uploaded urscript I believe that's one of the approaches Thomas has already tried and if I remember correctly it became unfeasible rather fast as the number of positions increased due to some startup and parsing penalty. Regarding my refactoring I agree TrajectoryFollower is a good place to inject your new "control scheme".

potiuk commented 6 years ago

Thanks @Zagitta - that's the kind of comments that I love and make my life much easier - It will help me to avoid some dead-ends when I implement it :).

BTW. I really like the refactor. It's super easy to read it and modify it. One of the other pull requests that I will make shortly will be the capability of controlling the RG2 Gripper (which nicely integrate with UR Robots) via the ur_modern_driver. I based it on https://github.com/sharathrjtr/ur10_rg2_ros (they used an old version of the original ur_modern_driver as base) - I added depth-compensation option tht we need. I found that modifying your refactored code will be so much nicer and easier to maintain, so we are going to port it following the patterns you introduced.

ThomasTimm commented 6 years ago

Thee is no doubt that your computer is powerful enough to run the driver (as stated previously, it can run just fine on a RPi). But apparently it lacks power to do all the other stuff you want, seeing the jerky-ness gets worse when you start doing all the other things.

I have no experience with docker, so I can't say how that does or doesn't influence network performance.

You are right in that the core problem is relying too much on TCP/IP low-latency; that is a fundamental drawback of this approach. Control should always be done as close to the hardware as possible, and not via network. Unfortunately, a lot of users won't install ROS (or anything else) on their brand new robot in fear of bricking it, we learned that with the C-API approach. So from a ROS-Industrial point of view, we have to have a solution that doesn't require installation on the robot. If you, in your controlled environment has the option of installing ROS on the controller, and supply that to your customers, I would certainly recommend that. What you must remember is that the "official" ROS driver, targeted at a wide and varied audience, has to be very general and be able to handle all likely and unlikely use cases if possible.

If, for your application, using movej would be sufficient, you might want to consider just sending the urscript via the implemented topic for doing so. That way you would have a workable and stable solution running by the end of today. But where's the fun in that? ;P

As Zagitta says, sending the entire trajectory pre-interpolated is just not feasible for the general use case. You (or at least I) just knows that at some time, somewhere, somebody will try to send a trajectory of several minutes of length. Each minute worth of trajectory is 125 60 6 = 45000 joint values in clear text, plus overhead like parenthesis and commas. That would take quite some time to transfer and for the robot to process (I urge you to try it out; make a small python script that generates the urscript for a 5 minutes long trajectory and uploads it to the controller), making the user think the robot doesn't respond.

I guess the stability could be increased by implementing some sort of intelligent buffer in the urscript code and then transmit say 40ms of trajectory every 2ms ( or maybe 8ms?), that continuously updates the buffer. But until you have scrutinized every corner of your network setup and made sure data is flowing steadily, any such attempts will only treat the symptoms and minimize the jerkiness, but it wont remove the problem - you will still experience jerkiness when the delay spike every once in a while is 50ms. And then we are back at the core problem of relying on low latency TCP/IP. But as that is the only way of controlling the robot (except if you can install ROS on the controller), the ~~best~~ only way to get stable control of the robot is by having your network run smoothly.

We have previously discussed the idea of including control of the RG2 gripper (or any other gripper for that matter), but as it doesn't have anything to do with the robot, except having a matching plug, such functionality should be kept in a separate package.

gavanderhoorn commented 6 years ago

First: 👍 for (re)starting the discussion about driver design decisions @potiuk. Much appreciated.

@ThomasTimm: thanks for the insights.

We have previously discussed the idea of including control of the RG2 gripper (or any other gripper for that matter), but as it doesn't have anything to do with the robot, except having a matching plug, such functionality should be kept in a separate package.

This definitely gets a 👍 from me: let's maintain separation of concerns as much as possible.

And then we are back at the core problem of relying on low latency TCP/IP. But as that is the only way of controlling the robot (except if you can install ROS on the controller) [..]

It's a nice idea to get closer to the controller, but all 'external' interfaces that exist use TCP/IP. Even if a process runs on the controller itself, it's limited to TCP/IP. Afaik, everything communicates with the URControl daemon (and everything is: Polyscope, RTDE and the Ethernet/IP daemon that was added) using sockets.

As Zagitta says, sending the entire trajectory pre-interpolated is just not feasible for the general use case.

You could argue that if the choice is made to essentially upload the entire trajectory as a URScript that we could let the robot interpolate, which would mean back to move*(..). That would allow for much sparser trajectories, lowering the amount of script lines transferred.

However, we would be giving up control over the execution of the trajectory in a way which would probably make it deviate in ways that would be undesirable.

ThomasTimm commented 6 years ago

Another reason why I didn't use the move* command is because I couldn't see how to handle chained trajectories/trajectories with intermediate poses, where the velocity at those poses are non-zero. The only way (as far as i know) to avoid having the robot stopping at each pose, when using move*, is to specify a blend radius, but that is certainly not desirable in all circumstances (I would hate to have a welding robot cut corners like that) - also what should that blend radius be? So while move* might be applicable in @potiuk use case, I don't see it being used in a general purpose driver.

potiuk commented 6 years ago

Thanks for the comments @gavanderhoorn @ThomasTimm -> I think I am getting a full picture now.

I understand the problem with long moves and sending full trajectories. You as package maintainers must take into account all the different scenarios.

Sending full trajectory

As Zagitta says, sending the entire trajectory pre-interpolated is just not feasible for the general use case. You (or at least I) just knows that at some time, somewhere, somebody will try to send a trajectory of several minutes of length. Each minute worth of trajectory is 125 60 6 = 45000

Yeah. I see it being a problem. Our case is a bit simpler. We split the moves into few (4) separate shorter moves and no single move should take more than 6 seconds. We are planning to do a number of optimisations like planning the next move in parallell to executing the previously planned one (in order to avoid robot to pause between move. This means that we are talking about 125 6 6 = 4.5K positions - much more feasible to send in one go. Of course - I wil first work on a solution that will be good for us, but I would love to be able to contribute it back via pull request eventually, so that others having similar cases could benefit. I think what I could do is have the "pre-planned" trajectory exectution implemented as experimental feature enabled with a flag AND only enabled for short moves (limited by planned execution time). If the planned execution time is longer than limit, I would fall back to the original method.

As a next step (but that would be follow-up) I could work on splititng longer move into sub-segments of up to 6 seconds long for example. If we can make like 6 seconds long segments of trajectory, I don't think we will be back to the low-latency problem. It would be much better optimisable actually - because we could send the second segment during execution of the first one and swap between the current/next buffers stored in URScript. I think we could then entirely avoid low-latency problem this way. Sending bigger block of data of TCP/IP rather than small messages is exactly what TCP/IP was designed for and I am 100% sure we can receive the whole next segment during the time the previous segment is being executed. It is quite a bit more complex to implement, but the implementation can be staged like I described - first simple short moves only and then segmenetation when we got it proven and tested (and maybe merged :D). I will make some tests on how big data we can transmit etc.

2) RG2_Gripper:

We have previously discussed the idea of including control of the RG2 gripper (or any other gripper for that matter), but as it doesn't have anything to do with the robot, except having a matching plug, such functionality should be kept in a separate package.

I perfectly understand your point. And now when I think about it, it indeed makes little sense to make it part of the driver. Closing and opening the gripper is essentially executing a custom RG2 URScript command, so instead of adding RG2_Gripper I would rather rea-dd the /ur_driver/URScript topic that allows to execute arbitrary script and our own node will simply execute the RG2 open/close script via this topic. Was there a specific reason this topic was removed during the refactor @Zagitta ? Note that it is still present in README even when it is removed from the code during refactor ;). I am happy to re-add it during my task if there are no fundamental problems with it.

3) Using MoveJ

If, for your application, using movej would be sufficient, you might want to consider just sending the urscript via the implemented topic for doing so. That way you would have a workable and stable solution running by the end of today. But where's the fun in that? ;P

Yeah. Where would be the fun :P, indeed. I don't think the changes I am proposing will take a long time to implement (especially the first part for short moves only). I will definitely try simpler move* solution along the way and see where it can get us, but similarly to what @gavanderhoorn wrote - I think the limitation of move commands is that you cannot simultaneously control position and velocity changes (you can just specify single speed of leading axis and acceleration of tool but no start/end velocities). I really like the servoj approach mainly because by doing interpolation and servoj on our own, we are taking into account both position and velocities calculated by MoveIt, as well as we make sure that the transitions between each step is really incremental. I am afraid just basing on the moves planned using internal move commands will be far less than ideal - possibly much worse what we observe now. So we could really benefit from all the calculations done by any of the MoveIt planners we choose to use (including our own planners in the future). And we are not taking into account accelerations calculated by MoveIt planners, but in the future if we continue doing the interpolation in the host, and we got the stable mechanism of trajectory segmentation, there is nothing to prevent us to implementing a bit more complex interpolation algorithm to also include acceleration.

Zagitta commented 6 years ago

I think the primary reason the UrScript topic isn't in my refactored version is because I forgot, at least it should be trivial to add :-)

potiuk commented 6 years ago

Thanks @Zagitta - i will add the urscript pull request shortly.

Update: I made tests with and without docker (withe the TCP optimised driver) and there is no noticeable difference as I suspected. Some random moves are still slowing down in the middle regardless if docker is used or not. Seems that there is slight difference (improvement) with realtime kernel and no docker but it's hard to quantify and even with that the moves are quite far from the smoothness we would expect.

I will post further findings here.

gavanderhoorn commented 6 years ago

@potiuk wrote:

Some random moves are still slowing down in the middle regardless if docker is used or not.

I didn't expect Docker to make any difference, but perhaps you're being bitten by ros-planning/moveit#416 (check also the OMPL issue on BitBucket).

ThomasTimm commented 6 years ago

Well, that is interesting Gijs. @potiuk Have you actually checked that the trajectory that is sent to the driver is jerk free? Take the poses sent to the driver's goal topic and interpolate between them with the qubic interpolator used in the driver.

gavanderhoorn commented 6 years ago

@ThomasTimm wrote:

Take the poses sent to the driver's goal topic and interpolate between them with the qubic interpolator used in the driver.

or use an Indigo version of MoveIt and see whether that improves things.

The change that seems to impact this the most is only 5 lines or so, but an Indigo MoveIt is probably easiest.

@potiuk: there are Docker images for MoveIt available which should make this not too hard. See Using Docker Containers with MoveIt!.

potiuk commented 6 years ago

I didn't expect Docker to make any difference, but perhaps you're being bitten by ros-planning/moveit#416 (check also the OMPL issue on BitBucket).

Interesting threads indeed @gavanderhoorn and @ThomasTimm . Thanks! I will check it before attempting to rewrite it. Indeed trying on indigo and our moveit configuration might be simplest. - I was also planning to take a look at the plots and make sure those trajectories are OK. I will do it before I attempt to do the almost-no-TCP/IP rewrite.

potiuk commented 6 years ago

Short update on the planned improvement work (after i will check if moveit trajectory is good). I got quite good hands on with the URScript/robot interface and I understand limitations of the URscript better. It seems that due to the urscript API limitations (send/receive methods) it will be indeed difficult (maybe impossible) to send even the 4.5 K data to the URScript. @ThomasTimm - now I understand what you meant :D. I will be in contact with UR engineers shortly and will raise some questions about it, maybe there will be some ways around. I have quite some experience in doing similar work from the telecommunication world (I used to program telecom switches that had even more limitations on messages sent/data stored internally).

This means that I will have to fall-back to the earlier idea of sending the coarse trajectory from MoveIt one-by-one to URScript instead of interpolating whole trajectory upfront. I will keep three trajectory points in URscript - previous target, current target and next target. This way I will avoid lags in communication. I will send subsequent point when the urscript will be busy with iterating through interpolations of the previous step. Then the interpolation will be done in the URScript. This should make it possible to have even very long trajectories working smoothly and without lags. We have some 40 points on average for 4 second moves generated by MoveIT, so seems that we can get back to around 10Hz communicaoin frequency - this should be more than enough to get rid of all the communication lags.

I got quite familiar with URScript now and I am able to do efficient 2-way communication urscript <-> ROS node (I got working ROS actionlib interface to control RG2 gripper using such approach already), so I don't think it will be too complex.

It should then allow for much less frequent communication and better control while trajectory is executed (no need for separate heartbeat).

gavanderhoorn commented 6 years ago

@potiuk wrote:

Short update on the planned improvement work (after i will check if moveit trajectory is good)

just a quick comment: I would really recommend you check the output of MoveIt before starting to change the driver.

You'll need to do it anyway and it shouldn't take that much time.

potiuk commented 6 years ago

I certainly plan to do it (and in this sequence).

This week I am busy with other parts (like achieving a milestone we planned - for this the refactored driver + TCP/IP options set + my RG2 Gripper node should be quite enough for the milestone).

And the speed improvements will start to be important after we achieve this milestone - then I will start with plotting and analysing the problematic trajectories and making sure that the problems are not caused by MoveIt planning. This will be base for our decision where to focus our effort first.

AndyZe commented 6 years ago

There are some fundamental limitations with the response time of position control. You can crank the control frequency up as high as you please but I think you will still see a faster response with velocity control at a lower control frequency. See http://proceedings.asmedigitalcollection.asme.org/proceeding.aspx?articleid=2482045

I know it sounds crazy, but that's what a Bode plot shows. (at least for the compliant robot I was working on, but I would guess for other cases as well)

AndyZe commented 6 years ago

“Although they are mathematically equivalent, velocity based and position based impedance control produced very different experimental results.” -Duchaine, 2007, "General model of human-robot cooperation using a novel velocity based variable impedance control"

Not to say, there couldn't be another issue causing this...

AndyZe commented 6 years ago

Other benefits of velocity control, i.e. sending speed commands to the joints: more robust to control signal delay and probably more energy efficient.

AndyZe commented 6 years ago

Further corroborated on pg. 12 of this pdf that @gavanderhoorn linked earlier. Notice, speedj has no downsides while the other command types all have one issue or another.

potiuk commented 6 years ago

Ok. Good ideas. Short update on where I am with the issue:

We achieved what we wanted in the last week and now our focus is on improving the speed and smoothness of the robot so I am mostly working on it this week and likely after holidays
Before diving deep and reimplementing I decided to build some tooling that will allow me to repeatably run and compare MoveIt generated plans with interpolated trajectories and actual joint states during execution on single set of charts This will allow me to quantify before/after results and make a verifiable proof of improvements other than feeling (I will be able to run and generate it multiple times for the same repeatable moves and see the results immediately. I will run it with/without additional load on the PC and see the impact. Plus it will capture robot video as well so that I can see the robot in action. I am close to finish that one.
I plan to switch to pre-kinetic moveit and see generated trajectories with it - using the same toolset. But I might not need to, if I see that the same trajectories generate different behaviour.
I already implemented and tested an important component of the solution - i.e. hand-shaking between the driver and ursim. I already implemented a control loop where my (python) client sends subsequent trajectories to urscript and the script consumes it one by one - in a 'slow' fashion - i.e. it will ask for the next trajectory point only after it started to interpolate positions between the two previous pointa. It works nicely and with moveit trajectories of 2/3 points per second - it should be really stable. The interpolation is not yet there but I am working on it.

potiuk commented 6 years ago

Clicked close by mistake. :). Continuing ...

With the tooling and script and hand-shaking I will be able to try different methods easily. I am certainly going to try different variations (including speedj)
Last but not least - we had a very good call with Jacob - the heading UR+ support and we had very deep technical discussion with him. He was extremely helpful in answering our questions and he confirmed that our suspicions about TCP interface opened by the urscript are likely valid. On the robot side UR implemented additional driver that handles TCP/IP connection (that's the one that limits the maximum number of numbers sent in one message) and it does introduce latency and delay - especially if the urcontrol is busy with controlling the robot. Urcontrol has highest priority and it will preempt other threads at times for short periods of time - including the TCP/IP driver.

What Jacob suggested instead (and that was actually great idea) is to use the - relatively new - RTDE interface for communication with the driver. It looks like the RTDE client has possibility of setting/updating up to 64 ints, 64 bools and 24 floats - that can be read by the urscript (and urscript can similar data back). This interface has much higher priority than TCP connection opened by the urscript and is pretty much guaranteed to be real time. Moreover the urscript can read /write those much faster (they are accessible via fast registries rather than having to be parsed via receive_ commands). That sounds very promising and I might also be able to try this one with my experiments (especially that in the future we will need much finer grade control of the robot and nearly realtime feedback loop).

AndyZe commented 6 years ago

Cool. Well the new jog_arm package is a pretty good example of sending speedj commands to a UR robot, and it seems buttery smooth at 100Hz to my eye. It won't be up to the task if you're doing high-precision machining or something like that, though. It's mostly intended for human teleoperation.

potiuk commented 6 years ago

Yep. Thanks for the link! That's the doubt we have with velocity control - that at the end we will have to anyway use position control (at least at the very end of the move) to move the arm to desired precise position anyway. And switching speedj -> servoj commands mid-flight is not going to be smooth at all. We discussed this with Jacob today as well and he confirmed position is probably the best way to go for us. Big part of our discussion was about calibration and how can we bring the individual calibration of each robot to reflect it in our URDF/kinematic model that we use outside of the robot (MoveIT). Precision (both accuracy and repeatability) is a must for us.

potiuk commented 6 years ago

I got some interesting charts to share. I actually manage to record both - pretty good and smooth move with real robot as well as one that had exactly the dangerous catch-up that is the problem of ours. The "catch-up" was so abrupt that it ended up with the robot in protective mode. It was even so abrupt that vibrations caused our light camera stand to shake a bit (you can see it in the movie that I recorded).

It occured when the ur_driver laptop was under quite a stress (reading and visualising in RVIZ the point cloud from RGBD camera + recording the video itself). But regardless of that I think it shows quite clearly the mechanism of what's going on.

You need to zoom-in the images quite a lot to see it better (they are big), but they should be quite self explanatory. I also re-wrote (well mostly copy-pasted) the interpolation code to python and interpolation is also visualised there (green) - this way I am sure that interpolation works as expected (and makes it easy to port to URScript). The problem looks indeed suspiciously like catching up when some of the trajectory points are queued somewhere on the way and delivered almost instantly to the URScript.

The "bad" move details are in

Corresponding video is here: https://youtu.be/qc_NM32XLvI (see the "abrupt catch-up" around 00:13s)

The "quite ok" move is here:

Corresponding video: https://youtu.be/-FqekvoSe7M - it still has some slow-downs but we are less concerned at this stage (the catch-up is what bothers us most). They indeed look like induced by moveit plan rather than by the driver/communication.

There is one suspicious thing about those charts - it looks like all the joint positions/velocities are planned in almost perfect "sync" - different ranges but for different joints shapes are pretty much the same (though you can see small differences). I will look a bit more, whether it's a bug in visualisation code, but we found it pretty OK that in an empty space with no collisions MoveIt planner will do exactly this and move the joints in sync using very similarly looking paths. I saw similar patterns when I was looking at rqt_graph charts so I am quite certain it is ok.

We will analyse it further, but maybe @gavanderhoorn @ThomasTimm you can take a look and have some additional observations?

potiuk commented 6 years ago

I have some good results with reimplemented logic.

Just in time to leave for Xmas :). This is the first time those moves are controlled by my modified control loop on the robot itself (URScript). We only send MoveIt trajectory to the UR control (with few Hz) and then the URScript does all the interpolation internally (with 125 Hz).

Here is the video: https://youtu.be/PPeJRgBuPSs

It seems to work very smoothly (still moveit generated trajectories might be improved). I still will need to run some full load tests to see how load on the laptop impacts it.

I checked that it even works over WiFI (!) - it behaves as I intended. Right now when there is a network delay over WIFI, it will stop pretty much immediately rather than "catch-up". This will still be improved in the near future and I will make it slow down instead - and continue when the messages arrive. That might require a bit more fine-tuning, but it should be fairly possible.

What's more - it even uses python client that I use for testing, rather than modified C++ modern driver, so I am pretty sure I am on a good track.

Overall - it looks really promising that we can solve the jerky behaviour (so that in the worst case the robot slows down rather than dangerously catch-up). I will continue with it after Xmas and hopefully you will see another pull request coming to the refactored version of the modern driver.

potiuk commented 6 years ago

BTW. Merry Xmas and Happy New year to everyone reading it ;)

potiuk commented 6 years ago

Some further progress:

I now have a rock-solid implementation of it working and tested. What I have so far:

no dangerous catch-ups ever (it's built-in in the algorithm). The robot stops - sometimes quite suddenly at the end of the move or when client is stopped but it will never, ever over-shoot the move. And I did not have a single "protective mode" case yet even with 3x speedup.
it follows the trajectory very well (see the charts).
It can run it with very fast speeds (i tried it up to 3x the speeds we dared to run it so far). It is perfectly safe now even if robot moves very fast - see the movie at the end. I can easily control the speedup by changing the ratio of "Robot time" vs. "Real time" - so I can even speed up already generated MoveIt trajectories (for tests now - later I will simply get MoveIt configured to allow faster moves).
no observed degradation even if the laptop is fully loaded (3D Camera, publishing depth-cloud and displaying it in RVIZ at the same time + recording video)
no observed degradation even when I run it over WiFi (!). It is really very resilient to pretty much arbitrary delays in the communication
it stops when client is stopped. I am also implementing "cancel" functionality for the action in ROS.
the client part is still in Python (!). I have to rewrite it in C++, but the URScript side will remain pretty much the same.
I have a very nice debugging built-in. It introduces some inacurracies in following the trajectory when I enable it, but it allows to see what's going on easily.

Pull request follows shortly.

The movie is here: (pauses between moves are simply when I generate the nice charts): https://youtu.be/8tWmKgFU3_g. This movie is recorded with laptop connected over WiFi.
Recorded behaviour for normal speed:
Recorded behaviour for 3x speedup over WiFi (robot time flows 3x faster than real time - the charts are right aligned):

potiuk commented 6 years ago

I just created the promised pull request (to the refactored version)- with very comprehensive description. results of my testing, movies, charts, and so on - so I won't repeat it here. Simply head on to https://github.com/Zagitta/ur_modern_driver/pull/9

I'd love if people who had similar problems check it out, build and see if it solves it for them as well. Note that you need to enable it via "use_safe_trajectory_follower" and without "use_ros_control". You can check which driver is used by looking at the logs in Polyscope.

potiuk commented 6 years ago

@gavanderhoorn. One more comment. I had a chance to test planning with Indigo. Quick tests shown us that the robot seems to generate a bit better trajectories for some moves (without full quantitative measurements). But for some other moves, the trajectories are not very smooth and we got an example of it very quickly. Example here:

jarekpotiuk_20180104153404_move_02 movetype moveit

rohitmenon86 commented 6 years ago

@potiuk To introduce myself, I am a researcher at DFKI, Bremen using universal robots for manipulation research. We use the RoCK framework (https://www.rock-robotics.org) for controlling our robots. For our recently purchased UR5 (FW3.4), we have adapted the ur_modern-driver for controlling the robot via an Orocos component.

We have been facing the same jerky motion problem. Though we did not face it with the earlier UR10 with FW 3.2. Hence I suspect this is the same problem. We are using an interpolator at 4ms to send trajectories to the UR driver and then we use the speedj command to send the command to the robot. I am attaching the command and feedback joint positions of the first 5 joints. There was no motion in the wrist3 joint.

I saw that you implemented changes including using ros trajectory follower. However since we are not using ROS, can you please help understand the changes you have made so that I can reimplement them in RoCK?

figure_1_shoulder_pan_joint figure_2_shoulder_lift_joint figure_3_elbow_joint figure_4_wrist1_joint figure_5_wrist2_joint

potiuk commented 6 years ago

@rohitmenon86 -> In essence what i've done was I moved the interpolation loop to URScript and lowered (2 orders of magnitude) communication frequency ur driver <> urscript.. Rather than sending TCP/IP message with 4 ms interval I am sending only coarse trajectory points (as calculated by MoveIt) - which happens to be max. few per second rather than 500/second with 4ms interval.

I implemented a communication scheme where you have separate threads sending and receiving data. Sending thread notifies the connecting driver when it should send the next trajectory point and receiving thread receives those points when the driver sends it. The communication happens using reverse TCP connection opened back to driver by URScript.

The main "work" is done in third thread - controlling thread - which is then not interrupted by sending/receiving, it exchangest the data with sending/receiving threads via global variables and performs interpolation (by default every 8ms /125Hz) calculating positions based on positions and velocities of the coarse trajectory points calculated by MoveIt. An important part of the solution is that I do not use the "real time" from the start of the move. I assume that every interpolation step takes exactly 0.008 and I assume that's the time that passed - so we make the interpolation calculation independent from any interruptions/delays. This make the move potentially lasting longer, but without the "jerky" catch-ups.

One more comment - you can simplify the threading model by using RTDE data exchange (using built-in registries). We did not want to change ur_modern_driver to use RTDE, but we are working on RTDE simplified driver (to be open-sourced) for our internal purposes.

rohitmenon86 commented 6 years ago

@potiuk Thank you for the quick reply. One more question- We have our own Whole Body control with collision avoidance controllers which uses velocity interpolator. Hence we prefer to send finely grained commands at 4/8 ms interval rather than coarse trajectory points. Will the refactored driver with its 3 thread scheme perform better for streaming speedj commands? Have you done any such experiments? (Asking this to get an idea before I develop an Orocos component for this)

potiuk commented 6 years ago

I do not know details of your solution, but the 'jerky move' problem was not caused by using some specific commands, but because of unreliable TCP stack and the fact that requrement of reliabilty of this connection at 500Hz frequency of communication was built in the original design. I only looked very closely at position based interface built-in the ur_modern_driver - I have not checked the other types of control.

So if you are sending something to robot over TCP stack very frequently (like 125Hz+) and rely on good timing and low latency you might have similar problems. But there could be a number of factors influencing it - the detailed design of communication between the driver and the urscript. Do you have any design doc describing the communication interface ? I think you'd have to assess yourselves (following the line of thoughts explained above) if the problems you have could have the same causes.

gavanderhoorn commented 6 years ago

@rohitmenon86 wrote:

Will the refactored driver with its 3 thread scheme perform better for streaming speedj commands?

there is almost no streaming any more in the variant implemented by @potiuk. Or: the time resolution (and thus space, via velocity) of the stream is very low, as sparse trajectories are interpolated on the controller itself.

I don't have any nrs, but I would expect that if you (@rohitmenon86) are trying to close a position control loop over a velocity control interface from outside the controller, sending sparse trajectories and having the controller interpolate them is not what you'd want.

gavanderhoorn commented 5 years ago

With the merge of #120 (which included the work by @potiuk) both approaches (ie: interpolation on the ROS side as well as the low-bandwidth trajectory follower) have been integrated in the driver.

Thanks for the analysis, suggested changes and the contribution @potiuk :+1:

And thanks to all the commenters on this thread for providing input.