Race condition accessing sys_position

bdurbrow commented 5 years ago

If I'm reading this right, I think there's a race condition accessing the sys_position array.

sys_position is updated at interrupt time by the ISRs in stepper.c; but is accessed in multiple locations throughout the codebase without disabling interrupts during the access.

This also affects the mega2560 port of Grbl.

terjeio commented 5 years ago

AFAICT the only place it may be an issue is in report.c since the other locations accesses the array when there is no motion. A copy is made in report.c, surely to minimize the risk for inconsistent data beeing used for the report. As a side note the risk insignificant when running on a 32bit processor - most likely the position reported is already out of date when reaching the sender.

chamnit commented 5 years ago

Thanks. I'm aware of this potential issue. I've kept an eye on this int32 read of the position variables over the last several years. There seems to be no ill effect, because the majority of times when the variable is accessed is controlled. It's either at a stop or inside the interrupt that alters it. As @terjeio stated the only time it's not is when generating a status report, which is non-critical.

That said, I don't exactly remember why I ended up omitting the atomic read. Maybe laziness or not wanting to turn off the global interrupts to reduce jitter or potentially missing some incoming data (don't think I ever tested this).

bdurbrow commented 5 years ago

when generating a status report, which is non-critical.

I'm going to politely disagree on that, as garbage data reported back for position could cause any number of issues with any applications depending on that data. In any case, Grbl's output should always be correct.

I don't think that the latency introduced by wrapping the memcpy with a cli is going to be a significant issue, as almost all drivers used with Grbl are micro stepping (or, rarely, actual servos)... perhaps on the order of 150 clock cycles? If I did the math right, perhaps 10 microseconds?

I'm going to give it a try, and see what the results are on my o-scope, with regards to timing & jitter impacts.

chamnit commented 5 years ago

@bdurbrow : There hasn't been a reported problem with report generation related to the potential race condition. But, don't let that stop you from testing it. I'm curious myself.

As for non-critical reports, there are other means to sync position, like from the comm protocol itself. Report position was intended to used for DRO purposes, not for critical processes, mainly because it's pseudo-realtime. Messages are sent back when Grbl has time to do so and can generate one at a rate of 10Hz or so max. The comm protocol is much faster.

terjeio commented 5 years ago

@bdurbrow

I'm going to give it a try, and see what the results are on my o-scope, with regards to timing & jitter impacts.

IMO it is a waste of time to try to isolate timing issues/jitter caused by making the memcpy atomic - to do that you need to find where any timing interrupt is delayed by the relatively rare status report. It surely will be drowned out by other interrupts occuring far more often.

I sometimes look at the assembly code generated by the compiler, eg. for ARM the memcpy is implemented like this:

    .dwpsn  file "../GRBL/report.c",line 625,column 5,is_stmt,isa 1
        LDR       A1, $C$CON84          ; [DPU_3_PIPE] |625|  - get address of sys_position
        LDMIA     A1, {A3, A2, A1}      ; [DPU_3_PIPE] |625|  - load sys position into registers
        STMIA     SP, {A3, A2, A1}      ; [DPU_3_PIPE] |625|  - store it to current_position (on the stack)

Here it is not even a function call, however the LDMIA instruction is interruptable - but I believe not in the middle of a single register load - possibly resulting in the report beeing off by maximum one step (that is likely to get rounded away). This is for a 32 bit processor though, the 8 bit Atmel code will be quite different.

So, if you can make the compile generate an assembler listing you will be able to asses the worst case jitter incurred by making it atomic, this by adding up the clock cycles needed for the memcpy plus interrupt disable/enable.

langwadt commented 5 years ago

with 32 bit variables on a 32 bit processor it isn't an issue, on an 8 it rather than disable interrupts I'd think it make more sense to make a "safe" read that check if the msb has changed after the read and does a retry

bdurbrow commented 5 years ago

IMO it is a waste of time to try to isolate timing issues/jitter caused by making the memcpy atomic

Eh... I needed to do some other related timing measurements anyway, so checking on this wasn't going far out of my way...

to do that you need to find where any timing interrupt is delayed by the relatively rare status report

Actually, no... you just need to induce comparable behavior.

I'm still playing with it, but what I've done this evening is put an atomic copy of sys_position into protocol_execute_realtime(), and put some port pin flips around both the atomic copy and into the top of the main stepper ISR. So, what I'm seeing on the o-scope is a) when the atomic copy starts (rising edge of the pin output pulse on channel 1 of my o-scope) and ends (falling edge); and when the ISR fires (a pulse on channel 2 of my o-scope).

The atomic copy is firing at about 25khz or so when it's doing a long rapid move (basically, I told it to seek somewhere out in the asteroid belt :-) and the atomic copy is taking about 6 microseconds to do (a bit better than my earlier guestamate of 10 microseconds). The apparent jitter that I'm seeing tonight is about 10 microseconds; and there are apparent collisions between the timing of the atomic copy and the firing of the stepper ISR.

terjeio commented 5 years ago

@bdurbrow : Do not forget that processor cycles are a finite resource.

Actually, no... you just need to induce comparable behavior.

Running the memcpy at 10Hz vs 25Khz is quite different, the latter is eating up a significantly larger amount of cycles available (increasing the CPU load). At 10Hz the risk of increasing jitter by atomic copy is small - eg. compared to the jitter introduced by shifting out the report string to the UART. But then your test is not triggering a report so that may affect the result too, by comparatively reducing the CPU load...

If you are worried about jitter perhaps it is worth considering switching to a 32bit platform, the memcpy I listed above runs in 15 cycles if I have added them up correctly - less than 200nS @ 80MHz. Perhaps the ESP32 will be the ultimate processor for generating jitter free step pulses if that is a priority - if the RMT peripheral can be used for pulse generation.

chamnit commented 5 years ago

@terjeio : As a point of reference, I do hear a significant audible improvement on the SAMD21 port with the step generation. Just moving to something a little bit faster and handles interrupt transitions better does make a big difference.

@bdurbrow : Keep in mind that there are two other interrupt processes that introduce jitter with Grbl. The step reset and serial receive interrupts. I've found that the AVR takes about 2-3 microseconds to process these. The one that gives the most jitter is the serial receive, when it occurs right before either step reset or step generator interrupts.

bdurbrow commented 5 years ago

10hz vs 25khz IS quite different - but I was just looking for a worst case scenario; and trying to provoke the latency, so it would show up on the scope to get a measurement. I’m guessing, but I figure that real-world results might be an additional 10 microsecond jitter event once every second or two... if that. I don’t think it’s going to be a meaningful factor in the kinds of machines that Grbl is used for...

@chamnit - yes, and there’s an additional one that the device drivers use (1khz timer - this was tested on my modified version of the Mega2560 port; so it’s got drivers for a lcd, SD card, and matrix keypad integrated). FWIW, however, I measured the apparent jitter last night with no serial data incoming, so the serial interrupt wasn’t a factor in that test...

chamnit commented 5 years ago

@bdurbrow : Forgot there is also the serial write interrupt. It's small, but the global interrupt overhead is fixed. If you had serial output, it could have been a contributing factor.

cri-s commented 5 years ago

@Britt, if you worry about wrong coordinates, why not fix it instead running simulations? I provide a fix at the end of this email. The potential bug exists but the possiblity that it get hit is really remote. Other interrupts don't have any significative importance for that fact that this bug could be triggered. Here the promised fix: Inside report_realtime_status there is a memcopy

memcpy(current_position,sys.position,sizeof(sys.position));

change this line to this

redo_memcpy: memcpy(current_position,sys.position,sizeof(sys.position));
for (idx=0; idx< N_AXIS; idx++)
   if(((uint8_t*)&sys.position[idx])[1]!=((uint8_t*)&current_position[idx])[1])
       goto redo_memcpy;

Chris.

2018-10-16 19:47 GMT+02:00, Sonny Jeon notifications@github.com:

@bdurbrow : Forgot there is also the serial write interrupt. It's small, but the global interrupt overhead is fixed. If you had serial output, it could have been a contributing factor.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/gnea/grbl/issues/541#issuecomment-430331743

britt commented 5 years ago

I think you meant to tag @bdurbrow 😃

terjeio commented 5 years ago

@cri-s : shouldn't the uint8_t array index be 3 in your code snippet? (for a little endian 32bit variable).

cri-s commented 5 years ago

No, it need to be 1 for little endian and 2 for big endian systems.

If you have union { uint32_t variable; uint8_t var[4]; };

little endian is when 32bit variable 0xabcde8ff is read as follow: // var[0]==0xff // var[1]==0xe8 // var[2]==0xcd // var[3]==0xab

The code checks the second byte if it is the same as the copied or if it have changed in the meantime. This don't necessary means that the copied position data is wrong, it just mean that there exist the minmal possibility of having corrupted data. Specifically in the above example, if the stepping interrupt code during memcpy increases the value, it became 0xabcde900 and in that case the memcopy is repeated. Instead the stepper code simple decreases the variable , the new value is 0xabcde8fe and in this case there is no possibility of getting a wrong value.

@Britt , thanks for the correction and sorry for the inconvenience.

2018-10-17 3:07 GMT+02:00, Terje Io notifications@github.com:

@cri-s : shouldn't the uint8_t array index be 3 in your code snippet? (for a little endian 32bit variable).

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/gnea/grbl/issues/541#issuecomment-430453781

shooter64738 commented 5 years ago

Or... Im not as good at this as you guys are so I did something totally different. I have a separate 2560 between my host and grbl. When you send a ?, it doesn't go to grbl, it gets picked up by the middle man controller first and it has been monitoring the step and direction pulses sent by gbl. It knows where the machine is all the time in real time. Grbl never does anything more than receive g code. And maybe the occasional setting change that gets passed through to grbl.

langwadt commented 5 years ago

@shooter64738 how can that work with homing, offsets and work coordinate systems?

shooter64738 commented 5 years ago

If you set a work coordinate offset it just subtracts it out from its known machine position and then sends the G54, G55, Etc. on to grbl. From that point on the work position could be different than the machine position. As for homing I don't currently use homing on my machine. But any commands could be passed through, even homing if you wanted. I over simplified what it does because I didn't want to hijack this thread too badly. :) Its primary function was to add canned cycles, CRC, look ahead, immediate feedback of position. It was intended to be a replacement for a host computer

gnea / grbl

Race condition accessing sys_position #541