rotational axis deceleration error

lllars commented 9 years ago

I'm having problems decelerating from certain rotational axis moves. The problem manifests as the motor stalling out and missing steps as soon as the moves starts to decelerate. This only occurs for moves that are both fairly large and which end at a large rotational axis position value. Some examples:

Move from A-2500 to A2500 --> no problem.
Move from A0 to A5000 --> motor stalls at end of move.
Move from A0 to A4000 --> no problem.
Move from A5000 to A6000 --> no problem.
Move from A5000 to A7000 --> stall.

Note these values are degrees, so these aren't ridiculously huge values. 5000 degrees is approx 14 turns. The problem gets worse for larger position values:

Move from A50000 to A50250 --> no problem.
Move from A50000 to A50300 --> stall.

Reducing the value of $ajm does not help. Nor does reducing the feed rate of the move. It appears that moves too short to plateau do not have this problem. The move status messages seem to show normal deceleration. Here is an example, note that the stall was observed at exactly the same time as the first deceleration status message (posa:51347.098,vel:14400.00).

g0 a52000 tinyg [mm] ok> vel:13.17,stat:5 posa:50000.203,vel:99.33 posa:50001.031,vel:312.25 posa:50003.102,vel:685.79 posa:50007.043,vel:1223.71 posa:50013.590,vel:1932.73 posa:50023.434,vel:2801.76 posa:50037.191,vel:3809.67 posa:50055.812,vel:4950.59 posa:50079.469,vel:6168.45 posa:50108.379,vel:7421.90 posa:50141.887,vel:8667.69 posa:50181.184,vel:9840.27 posa:50225.406,vel:10947.02 posa:50274.074,vel:11928.49 posa:50326.578,vel:12756.53 posa:50381.113,vel:13401.26 posa:50437.977,vel:13873.76 posa:50496.441,vel:14180.18 posa:50557.062,vel:14343.59 posa:50618.133,vel:14396.85 posa:50679.281,vel:14400.00 posa:50739.242 posa:50800.402 posa:50861.562 posa:50922.723 posa:50983.883 posa:51043.844 posa:51103.805 posa:51163.766 posa:51224.926 posa:51286.086 posa:51347.098,vel:14400.00 posa:51407.059,vel:14384.41 posa:51467.988,vel:14294.98 posa:51528.273,vel:14076.29 posa:51587.289,vel:13696.00 posa:51644.309,vel:13138.51 posa:51697.539,vel:12419.47 posa:51747.434,vel:11541.42 posa:51793.371,vel:10526.04 posa:51835.617,vel:9379.52 posa:51872.812,vel:8158.38 posa:51904.746,vel:6904.18 posa:51930.891,vel:5660.19 posa:51952.457,vel:4491.76 posa:51969.102,vel:3391.47 posa:51981.352,vel:2418.32 posa:51989.797,vel:1600.00 posa:51995.055,vel:965.52 posa:51998.031,vel:503.30 posa:51999.441,vel:206.33 posa:51999.930,vel:50.72 posa:52000.000,vel:2.24 vel:0.00,stat:3

lllars commented 9 years ago

Update: I tried to fool the system by scaling down my value of $4tr by 100 and adjusting max velocity and jerk values proportionately. Now it goes 360 degrees and thinks it only went 3.6 degrees. However, that didn't make the problem any better. The motor now just stalls at lower position values (like 50 instead of 5000).

For reference, I do have a fair amount of gear reduction on this axis. The motor turns 77 times per revolution of the axis. 4 microsteps, 1.8 deg per step.

lllars commented 9 years ago

Note that issue occurs even when there should be no deceleration after a move. For example:

G90 G1 F10 X-6 A-24 G1 F600 A-25

Here the intent is that the x-axis moves while the a-axis rotates, then the x-axis stops but the a-axis keeps rotating at the same speed. Note that the feedrate is specified differently for moves involving only the rotational axis. On my machine these two values result in the same rotational speed.

The result of this code is a motor stall at the end of the G1 F10 move.

lllars commented 9 years ago

Update: Added some printf()'s to st_prep_line() to see what was happening. It looks like it's not a deceleration problem, but rather a single move segment with a large difference in the amount of travel_steps towards the end of the plateau section of the move. The deceleration part of the move actually looks very smooth. Here is an excerpt from a G0 A50000 to A50300 move:

s_time: 0.000083        f_error: 1.0000         t_steps: 205.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 204.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 205.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 204.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 177.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 204.0000
s_time: 0.000083        f_error: 1.0000         t_steps: 205.0000
s_time: 0.000083        f_error: 0.0000         t_steps: 204.0000

Note the odd single segment with only 177 travel_steps. The deceleration part of the move begins about 15 move segments later, and is very smooth, decreasing in length by 1 travel_step for each move segment.

giseburt commented 9 years ago

@lllars Great digging! Thank you!

That is a very odd thing you're seeing. Could you add to that output the value of mr.section, so we could see the exact segment that is the last of the body and which is the start of the tail?

I can't see a place in the code where the body segments would be of different size (more that the dithering you see caused by it being an integer number of steps) in the middle of the move. This leads me to believe that that move is the last move of the body or the first of the tail. (The head and tail sections are an S-curve, so you might not see the velocity change immediately after they start.)

Knowing which (or neither!) of those cases it is will help greatly to diagnose this.

Also: which branch and commit are you using code from?

lllars commented 9 years ago

Ok, added the line:

printf("\tmr.sec %d", mr.section);

to st_prep_line with my other debug lines. It looks like that odd move segment is the last one in the body. mr.section is '1' for that line (and lines previous), and '2' for all lines afterwards.

I am running my fork of edge 82.11 (which has backlash compensation).

giseburt commented 9 years ago

Oh hold on. This is the branch with the backlash compensation? That's likely the issue.

Let me explain (roughly) why I believe so:

_exec_aline_body() divides the body segment into same-length same-velocity segments.
_exec_aline_segment() then calculates the length of each segment like this:
```
float segment_length = mr.segment_velocity * mr.segment_time;
```
Then the "target" for this segment is calculated based on the current position and that segment_length
Repeat for each segment on the body. (In fact, for every segment, with the exception that head, body, and tail segments are prepared differently.)

When you're manipulating the steps taken with the backlash compensation, if you don't re-compensate the position it is expecting and the position it is actually, then at the end of a section (head, tail or body) it will "erase" that difference, causing the in-continuity that you're experiencing.

I'm not sure what to suggest as a solution. You're tests with the encoders are pretty far off from what we expected them to be used for, and I'm not exactly sure how you're manipulating them.

You're also using them for much larger compensation than we have worked with. You might need to have the _exec_aline_*() functions recompute the segments more often (or with less assumptions, at least) than we currently do.

-Rob

PS: Please mention if your results are on your own experimental branch, and test against the branch you forked from as well. We really appreciate the experimentation and don't mind helping as we can, but it would have been nice to know that this isn't in the code that we were looking at.

lllars commented 9 years ago

Sorry for the confusion. In the future I will make sure to mention what branch I am working on up front.

Your explanation is interesting and helpful. However, this bug is definitely not related to backlash compensation. I don't have backlash compensation enabled for the A axis. Or, rather it is enabled, but set to zero, which means it doesn't do anything. This bug does occur in moves in which only the A axis is involved.

If you are interested in exactly how I've setup backlash compensation, it is really very very simple. Just look at lines 1107 through 1129 in stepper.cpp at this link. That is the only part that actually does anything, and all it does is offset the following error by the specified amount of the backlash when the axis is moving in one direction. All the rest of the changes are just variable declarations.

lllars commented 9 years ago

Following along with your explanation and looking at the code, I'm starting to understand what is happening. I started plugging in numbers to the formulae in exec_align_body() and exec_align_segment() using the G0 A50000 to A50300 move I mentioned above.

The problem happens when mr.gm.target is converted to mr.target_steps using ik_kinematics() in exec_align_segment(). My machine has 171.111... microsteps per degree in the A axis. For this move, mr.gm.target is approx 50000. Converting that to (micro)steps is 50000 * 171.111... = 8,555,555.555...

mr.target_steps is a 32-bit float; it can only handle 7 significant digits. So, the decimal part of the number of steps is getting lost. Each segment length is thus off by that little bit, and all those errors add up until the last step of the body is reached. The length of this last step is set equal to the distance to the next waypoint by copy_vector(mr.gm.target, mr.waypoint[mr.section]). The difference in length of this step is equal to the accumulation of all the previous errors.

lllars commented 9 years ago

Thinking about how to solve this problem....

1.) The easiest solution would be switch to 64-bit floats, but I think they are not available to us, right?

2.) One approach is to calculate travel_steps[] directly from the segment_length rather than from the difference between mr.target_steps[] and mr.postion_steps[]. Something like this:

mr.travel[i] = (mr.unit[i] * segment_length);
ik_kinematics(mr.travel, mr.travel_steps);
mr.target_steps[i] = mr.position_steps[i] + travel_steps[i];

That would allow the travel_steps to be calculated without loss of precision, however mr.target_steps[] would still accumulate error. This error would make its way into mr.following_error[] which may or may not be a problem depending on whether mr.encoder_steps[] also accumulates the same error. Thus, this doesn't seem like the best approach.

3.) Another option would be to use declare these variables as fixed point rather than floats. Reading up on it, it looks like long _Accum would be appropriate, as it would provide 32 bits of precision on both left and right sides of the decimal point. Yes, the fixed point type lacks the amazing 10^39 dynamic range of the 32-bit float, but the float's lack of significant digits really limits its usefulness when counting steps beyond approx 10^6. long _Accum is good to over 10^9. Furthermore, it would be fairly simple to extend the range further if deemed necessary by using the extra precision available to the right of the decimal place. One could, for instance divide all step counting variables by 1000.

In the case of my particular machine, and this particular bug, the swtich to fixed point variables would eliminate any problems until I try to exceed 70000 revolutions. That is without the "divide by 1000" trickery. Remember that I am currently experiencing problems at 14 revolutions. So, I think this is a good approach to solving this bug.

giseburt commented 9 years ago

We've had success in the past with Kahan Summation ( http://en.m.wikipedia.org/wiki/Kahan_summation_algorithm). In fact, I believe we've even had it in this part of the code in the past and determined (apparently wrongly) that it was overkill.

It's fairly simple to implement and is designed to solve just that exact kind of round off that you're seeing.

-Rob

On Saturday, April 11, 2015, lllars notifications@github.com wrote:

Thinking about how to solve this problem....

1.

The easiest solution would be switch to 64-bit floats, but I think they are not available to us, right? 2.

One approach is to calculate travel_steps[] directly from the segment_length rather than from the difference between mr.target_steps[] and mr.postion_steps[]. Something like this:

mr.travel[i] = (mr.unit[i] * segment_length); ik_kinematics(mr.travel, mr.travel_steps); mr.target_steps[i] = mr.position_steps[i] + travel_steps[i];

That would allow the travel_steps to be calculated without loss of precision, however mr.target_steps[] would still accumulate error. This error would make its way into mr.following_error[] which may or may not be a problem depending on whether mr.encoder_steps[] also accumulates the same error. Thus, this doesn't seem like the best approach.

Another option would be to use declare these variables as fixed point rather than floats. Reading up on it, it looks like long _Accum would be appropriate, as it would provide 32 bits of precision on both left and right sides of the decimal point. Yes, the fixed point type lacks the amazing 10^39 dynamic range of the 32-bit float, but the float's lack of significant digits really limits its usefulness when counting steps beyond approx 10^6. long _Accum is good to over 10^9. Furthermore, it would be fairly simple to extend the range further if deemed necessary by using the extra precision available to the right of the decimal place. One could, for instance divide all step counting variables by 1000.

In the case of my particular machine, and this particular bug, the swtich to fixed point variables would eliminate any problems until I try to exceed 70000 revolutions. That is without the "divide by 1000" trickery. Remember that I am currently experiencing problems at 14 revolutions. So, I think this is a good approach to solving this bug.

— Reply to this email directly or view it on GitHub https://github.com/synthetos/g2/issues/75#issuecomment-91870690.

lllars commented 9 years ago

I can see how that would help keep the error from accumulating. But, as I see it, the problem is really inherent to the 32-bit float.

Consider that this issue arose because I was trying to move an axis to a position that equated to ~1 million microsteps. What happens when someone tries to go to 10 million? No matter how carefully we do the math, a float32 can only handle 7 digits, so the ones digit (the first digit to the left of the decimal) has to get rounded off. We are left only knowing the machine's position to within 10 steps.

So, what variable type should we switch to? Is fixed point tricky for some reason? I've never used it so I have no idea. If so, maybe integer math is a better plan? 64-bit ints would be easy. 32-bit ints would work, especially if we use kahan summation to keep track of error.

aldenhart commented 8 years ago

Still open

diverseg commented 8 years ago

It appears the issue Long moves ending in bad deccel? #123 is related

diverseg commented 8 years ago

@lllars This issue may be fixed in branch edge dev-123-bad-decel. It would be good to know if this fixes the issue

giseburt commented 7 years ago

Closing this, as I believe it is fixed in #123.

synthetos / g2

rotational axis deceleration error #75