terjeio / grblHAL

This repo has moved to a new home https://github.com/grblHAL
232 stars 90 forks source link

> Code for handling GPIO pins resulting in a huge latency for setting step outputs (500 ns per pin) #77

Closed HuubBuis closed 2 years ago

HuubBuis commented 4 years ago

Big mistake by me when creating this driver, I have used Arduino code for handling GPIO pins resulting in a huge latency for setting step outputs (500 ns per pin), and it will be a while before I am going to correct this. Or maybe you want to correct that as a challenge? When the SAMd21 arrives and I can test it on the bench, I will give it a try. Originally posted by @HuubBuis in https://github.com/terjeio/grblHAL/issues/73#issuecomment-661994913 @terjeio I have checked the SAMD21 driver and it runs at 48MHz. I measured the time it took to set a pin on 3 different ways. The pinOut() results are the same as the PORT results. That was to expected. The assembly code shows 3 lines to set the pin, that is OK for this processor. The time to set a pin using the current grbl code is 88ns when running at 48Mhz (measured 502 on 8MHz).

I will check the actual time it takes to set the step ouputs later this week to see if I can match the 500ns latency you found.

Code and results:

  #define pinOut(p, e) { if(e) PORT->Group[g_APinDescription[p].ulPort].OUTSET.reg = (1 << g_APinDescription[p].ulPin); else  PORT->Group[g_APinDescription[p].ulPort].OUTCLR.reg = (1 << g_APinDescription[p].ulPin); }
  #define TestPin PORT_PA18   // D10

    pinOut(12u,1); // it takes 88 ns to set the port using pinOut(), current grbl code!
    pinOut(12u,0);

    PORT->Group[0].OUTSET.reg = TestPin; // it takes 88 ns to set the port using PORT!
    PORT->Group[0].OUTCLR.reg = TestPin;

    PORT_IOBUS->Group[0].OUTSET.reg = TestPin; / / it takes 22 ns to set the port using PORT_IOBUS!
    PORT_IOBUS->Group[0].OUTCLR.reg = TestPin;
HuubBuis commented 4 years ago

I have analyzed the code a bit more. The stepper interrupt service ( static void STEPPULSE_IRQHandler (void)) generates the stepper pulse by setting and clearing the steppers pins. Every set/clear action is done on all 3 axis, requiring 3 calls to set/clear a pin.

The signal on PORT_PA19 (Coolant) is used for measuring.

// Step pulse handler
static void STEPPULSE_IRQHandler (void)
{   
    if(STEP_TIMER->COUNT16.INTFLAG.bit.MC1) {
        STEP_TIMER->COUNT16.INTFLAG.bit.MC1 = 1;
        set_step_outputs(next_step_outbits); // Begin step pulse.
    } else {
        STEP_TIMER->COUNT16.INTFLAG.bit.MC0 = 1;
        set_step_outputs((axes_signals_t){0}); // End step pulse.
    }
}

// Set stepper pulse output pins
inline static void set_step_outputs (axes_signals_t step_outbits)
{
    PORT_IOBUS->Group[0].OUTSET.reg=PORT_PA19; // start measuring
    step_outbits.value ^= settings.steppers.step_invert.mask;
    pinOut_IOBUS(X_STEP_PIN, step_outbits.x);
    pinOut_IOBUS(Y_STEP_PIN, step_outbits.y);
    pinOut_IOBUS(Z_STEP_PIN, step_outbits.z);
    PORT_IOBUS->Group[0].OUTCLR.reg=PORT_PA19; // end measuring
}
terjeio commented 4 years ago

Interesting obeservations, thanks for that.

For some drivers I have added the option to choose from three different output modes, this is how it is done in the MSP432 driver.

https://github.com/terjeio/grblHAL/blob/86c1d5b71f37b7ceed1e40cb5fdd793a110fe3cd/drivers/MSP432/driver.c#L240-L256

I wonder what speed can be achieved by using bitbanding (if available), this should be the fastest when the pins are on different ports.

Another test would be to split the set_step_outputs function into two different versions, one to set bits and one to clear bits.

But then I do not understand where/why the clock cycles are wasted, the reduced overhead by combining calls does not square with the raw overhead from your initial test. A peek at the assembler code generated by the compiler may be revealing?

terjeio commented 4 years ago

As a reference here is the bitband version assembly from the MSP432 driver, with no optimizations and pins on the same port:

set_step_outputs:
;* --------------------------------------------------------------------------*
    .dwcfi  cfa_offset, 0
        SUB       SP, SP, #8            ; [DPU_V7M3_PIPE] 
    .dwcfi  cfa_offset, 8
$C$DW$167   .dwtag  DW_TAG_variable
    .dwattr $C$DW$167, DW_AT_name("step_outbits")
    .dwattr $C$DW$167, DW_AT_TI_symbol_name("step_outbits")
    .dwattr $C$DW$167, DW_AT_type(*$C$DW$T$58)
    .dwattr $C$DW$167, DW_AT_location[DW_OP_breg13 0]

        STRB      A1, [SP, #0]          ; [DPU_V7M3_PIPE] |245| 
    .dwpsn  file "../driver.c",line 247,column 5,is_stmt,isa 1
        LDR       A1, $C$CON10          ; [DPU_V7M3_PIPE] |247| 
        LDRB      A2, [SP, #0]          ; [DPU_V7M3_PIPE] |247| 
        LDRB      A1, [A1, #0]          ; [DPU_V7M3_PIPE] |247| 
        EORS      A1, A1, A2            ; [DPU_V7M3_PIPE] |247| 
        STRB      A1, [SP, #0]          ; [DPU_V7M3_PIPE] |247| 
    .dwpsn  file "../driver.c",line 248,column 5,is_stmt,isa 1
        LDR       A2, $C$CON11          ; [DPU_V7M3_PIPE] |248| 
        LDRB      A1, [SP, #0]          ; [DPU_V7M3_PIPE] |248| 
        AND       A1, A1, #1            ; [DPU_V7M3_PIPE] |248| 
        STRB      A1, [A2, #0]          ; [DPU_V7M3_PIPE] |248| 
    .dwpsn  file "../driver.c",line 249,column 5,is_stmt,isa 1
        LDR       A2, $C$CON12          ; [DPU_V7M3_PIPE] |249| 
        LDRB      A1, [SP, #0]          ; [DPU_V7M3_PIPE] |249| 
        UBFX      A1, A1, #1, #1        ; [DPU_V7M3_PIPE] |249| 
        STRB      A1, [A2, #0]          ; [DPU_V7M3_PIPE] |249| 
    .dwpsn  file "../driver.c",line 250,column 5,is_stmt,isa 1
        LDR       A2, $C$CON13          ; [DPU_V7M3_PIPE] |250| 
        LDRB      A1, [SP, #0]          ; [DPU_V7M3_PIPE] |250| 
        UBFX      A1, A1, #2, #1        ; [DPU_V7M3_PIPE] |250| 
        STRB      A1, [A2, #0]          ; [DPU_V7M3_PIPE] |250| 
    .dwpsn  file "../driver.c",line 256,column 1,is_stmt,isa 1
        ADD       SP, SP, #8            ; [DPU_V7M3_PIPE] 
    .dwcfi  cfa_offset, 0
$C$DW$168   .dwtag  DW_TAG_TI_branch
    .dwattr $C$DW$168, DW_AT_low_pc(0x00)
    .dwattr $C$DW$168, DW_AT_TI_return

        BX        LR                    ; [DPU_V7M3_PIPE] 
        ; BRANCH OCCURS                  ; [] 
    .dwattr $C$DW$165, DW_AT_TI_end_file("../driver.c")
    .dwattr $C$DW$165, DW_AT_TI_end_line(0x100)
    .dwattr $C$DW$165, DW_AT_TI_end_column(0x01)
    .dwendentry
    .dwendtag $C$DW$165

    .sect   ".text"
    .clink
    .thumbfunc set_dir_outputs
    .thumb
HuubBuis commented 4 years ago

The requirement for all step axis on the same PORT is not a big deal. The person who changes the port mappings, can also change the code for setting the step outputs.

Another test would be to split the set_step_outputs function into two different versions, one to set bits and one to clear bits. When on the same port , the mask to clear the bits can be calculated at startup (and after changing settings), that would help.

I have tested using this optimize option, that reduced the processing time to 1.56 us.

inline static void set_step_outputs (axes_signals_t step_outbits) __attribute__((optimize("-O3"))) ; //Optimize this code to get the fastest processing resulting in fastest speed, may reduce the speed of other code

I have looked at the MSP432 assembly output. There are 21 lines of code. I wonder how many clock tics it takes to execute. I will have a look at the SAMD21 assembly output to compare the amount of code that has to be executed. After that, I will modify the code for the requirement of all axis on the same port and no step invert. That will give an indication of what is feasible.

Next week I have an overloaded agenda. After that, there will be time to experiment.

terjeio commented 4 years ago

I'll do some timing tests with the MSP432 later, maybe sunday, and post the results here. This could be useful as a baseline reference for comparing output modes and interrupt latency across processors.

terjeio commented 4 years ago

Outputting a bit on the 48 MHz MSP432 takes a whopping 350ns, it uses far more clock cycles than is obvious from the assembler listing. The reason behind this is how the instruction pipeline works - and perhaps also the peripheral bus speed. Optimization settings has no effect whatsoever, except for the interrupt latency involved.

For the iMXRT1062 driver I have used structs to store GPIO data for improved speed and to be able to use the Arduino pin numbers for mappings. This approach could possibly be used for the SAMD21 as well? Be aware that the iMXRT1062 is a M7 processor and does not, AFAICT, support bit-banding.

A side note: iMXRT1062 is in a class of its own with only 200 nS subtracted from the pulse timer period to achieve step pulses pretty close to the $0 setting value over the 2 - 10 µs range I have tested.

HuubBuis commented 4 years ago

I have done some test (bit toggling) on the SAMD21 and STM32 using an Arduino sketch for checking the pin maps. On the SAMD21 it results in a 12 MHz pulse. On my SAMD21 mini (10$ ebay) pin D2 and D4 are swapped!

I guess (not checked yet) the code before the actual bit setting takes far more CPU time than the actual bit setting itself. A solution would be to use a timer output for this task. That makes it also possible to stop the stepper.

I am not sure about the impact of the latency. Most of my external stepper drivers are limited to 100 kHz. If the latency is stable, I do not see a problem

IOtest.zip

terjeio commented 4 years ago

I have just commited an update to iMXRT1062 style digital output to the test branch and I cannot discern any significant overhead per bit anymore. I hope you don't mind I did so.

Old code: #define pinOut(p, e) { if(e) PORT->Group[g_APinDescription[p].ulPort].OUTSET.reg = (1 << g_APinDescription[p].ulPin); else PORT->Group[g_APinDescription[p].ulPort].OUTCLR.reg = (1 << g_APinDescription[p].ulPin); }

New code: #define DIGITAL_OUT(gpio, on) { if(on) gpio.port->OUTSET.reg = gpio.bit; else gpio.port->OUTCLR.reg = gpio.bit; }

No more array lookups that I suspect triggered pipeline overhead.

What I do not like is that there is some random long pulses that I have not noticed before on the scope but can see when I connect leds to the step outputs. Could it be the USB driver? Perhaps I should try UART serial to see it that helps.

A solution would be to use a timer output for this task.

A timer per step pin then and use compare outputs to set/clear the pins? Interesting idea.

HuubBuis commented 4 years ago

What I do not like is that there is some random long pulses that I have not noticed before on the scope but can see when I connect leds to the step outputs. Could it be the USB driver? Perhaps I should try UART serial to see it that helps. That long Jitter can maybe reduced by giving the stepper irq the highest priority and all other irq's a lower. Synchronizing the peripheral bus speed to the CPU bus speed, would also help a bit. I hope it is not caused by missing a pin clear request!

I have to do some test to get the IO speed when uploaded using Arduino the same as when uploaded using Atmel studio. It probably has something to do with the optimize settings. At the moment, the Arduino environment is the fastest!

A timer per step pin then and use compare outputs to set/clear the pins? Interesting idea.

The requirement for a timer ouput for each stepper, will make it difficult to add more steppers.

terjeio commented 4 years ago

The step timer already has the highest priority (0), even with UART serial there is random jitter - some quite long. This also when issuing motion commands via a terminal so no status requests going on.

I am using the Arduino framework a bit more for this driver compared to the other Arduino based ones. Perhaps there is something going on in that? Something disabling IRQs for extended periods of time? The step timer period is set on boot, and the counter reset to 0 on start of a step, and the period is not changed later. So IMO this randomness is a bit weird, not seen anything like it before.

The requirement for a timer ouput for each stepper, will make it difficult to add more steppers.

Yes, but boards like the MKRZERO does not have many pins to start with. I have added an I2C expander on my CNC BoosterPack to get some extra GPIO pins.

terjeio commented 4 years ago

I did wrong... I had step delay set to 5 µs - it seems that is the reason for the random jitter. Time to set a pin is comparable to MSP430 - down to around 350 ns. What is not so good is that latency is around 3.5 µs, MSP432 is around 1.6 µs. This is not only the interrupt latency itself but also the time to set two pins, reset and start the timer. It could be that waiting for the SYNCBUSY flag when resetting/starting the timer is quite a bit of this latency time?

I have limited the minimum step time to a few clock cycles, resulting in pulses of around 3.7 µs. Good enough.

Sorry for bothering you with the previous comments. I'll commit an update to the test branch in a few minutes.

HuubBuis commented 4 years ago

waiting for the SYNCBUSY flag

For the SAMD21G according to the datasheet: If I understood it right , you don't need to wait for the sync bus if you write to an IO_BUS_PORT register. Chapter 22.5.3, 22.5.10, 22.6.1, 22.6.2.1 For peripherals, i am not sure, will have to read that chapter again (tomorrow)

Sorry for bothering you with the previous comments. I'll commit an update to the test branch in a few minutes.

This is not a bother, just a learning curve on both sides.

terjeio commented 4 years ago

waiting for the SYNCBUSY flag

The timer registers TC COUNT and TC CTRLBSET that are used when starting a step pulse are both flagged as Write-Synchronized. I am not 100% sure this means one should always wait on the SYNCBUSY flag before continuing...

By reorganizing the code a bit I am now down to 2.3 µs latency when delay is not involved. I have forced the set_step_outputs() function to always inline and added a simpler interrupt handler that does not check for the MC1 timer interrupt for that.

Enabling delay brings out the gremlins...

If an interrupt somewhere causes the delay interrupt (MC1) to arrive late the step pulse reset interrupt (MC0) may already be set. To avoid this the step pulse count should be added to the current count after setting the output pins. More overhead...

Since the delay is really only needed on direction changes a better solution could be to add the delay on changes and switch the interrupt handler dynamically. But as a delay is really not needed when AMASS is enabled, this since there is no step pulse generated after a direction change, perhaps a better solution is to get rid of the delay setting altogether?

HuubBuis commented 4 years ago

The timer registers TC COUNT and TC CTRLBSET that are used when starting a step pulse are both flagged as Write-Synchronized. I am not 100% sure this means one should always wait on the SYNCBUSY flag before continuing...

I started rereading the datasheet but have not finished yet. Preliminary conclusions:

Since the delay is really only needed on direction changes

The stepper pulse has to be long enough so that it is detected by the driver

But as a delay is really not needed when AMASS is enabled

Haven't a clue what you mean by this?

will be continued

terjeio commented 4 years ago

It is safe to write the stepper registers without checking after the initial setup.

After "inital setup" is that after a cold start of the MCU only?

This MCU is an oddball in many respects - I have to admit I do not like it much. It shares the same documentation problems as some of the others I am working with, hard to understand and not clear at all. But maybe this is because I am not young anymore...

AMASS is "overclocking" the stepper ISR and is enabled by default. There are three levels.

// Adaptive Multi-Axis Step Smoothing (AMASS) is an advanced feature that does what its name implies, // smoothing the stepping of multi-axis motions. This feature smooths motion particularly at low step // frequencies below 10kHz, where the aliasing between axes of multi-axis motions can cause audible // noise and shake your machine. At even lower step frequencies, AMASS adapts and provides even better // step smoothing. See stepper.c for more details on the AMASS system works.

When this is enabled, as it is by default, I have never seen a step pulse been output when a new block is started and hal.stepper_pulse_start() is called to output step pulses. Direction changes will never happen within a block so from that one can deduct that when AMASS is enabled there will always be a delay between a direction change and a new step pulse beeing output. Adding a delay for every step pulse output is then a waste of processor cycles. If AMASS is not enabled a delay is, IMO, only required on a direction change and a pulse is to be output at the moment direction changes - not before every pulse after that.

Below is the code that I have decide to use, hal.stepper_pulse_start -> stepperPulseStartDelayed when step delay is set > 0. On a new block I check if there are step pulses to be output, if so I call stepperPulseAddDelay() which sets up the timer for delayed steps and redirects its IRQs to a handler which outputs the steps after the delay has passed. In this handler I reset operation back to "normal". I should have added a check for a direction change on a new block but see no point in that for now...

I hope this helps.

My latest code wil be committed to the test branch soon.

// Sets stepper direction and pulse pins and starts a step pulse with an initial delay
static void stepperPulseAddDelay (stepper_t *stepper)
{
    if(stepper->new_block) {
        stepper->new_block = false;
        set_dir_outputs(stepper->dir_outbits);

        IRQRegister(STEP_TIMER_IRQn, STEPPULSE_Delayed_IRQHandler);

        STEP_TIMER->COUNT16.CC[0].reg = 50000;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);
        STEP_TIMER->COUNT16.INTENSET.bit.MC1 = 1; // Enable CC1 interrupt
    }

    if(stepper->step_outbits.value) {

        STEP_TIMER->COUNT16.COUNT.reg = 0;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);

        next_step_outbits = stepper->step_outbits; // Store out_bits

        STEP_TIMER->COUNT16.CTRLBSET.reg = TC_CTRLBCLR_CMD_RETRIGGER;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);
    }
}

// Sets stepper direction and pulse pins and starts a step pulse with
// an initial delay if new bock and step is to be output
static void stepperPulseStartDelayed (stepper_t *stepper)
{
    if(stepper->new_block) {
        if(stepper->step_outbits.value) {
            stepperPulseAddDelay(stepper);
            return;
        }
        stepper->new_block = false;
        set_dir_outputs(stepper->dir_outbits);
    }

    if(stepper->step_outbits.value) {

        STEP_TIMER->COUNT16.COUNT.reg = 0;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);

        set_step_outputs(stepper->step_outbits);

        STEP_TIMER->COUNT16.CTRLBSET.reg = TC_CTRLBCLR_CMD_RETRIGGER;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);
    }
}
HuubBuis commented 4 years ago

It is safe to write the stepper registers without checking after the initial setup. After "inital setup" is that after a cold start of the MCU only?

Setting the timer count register is safe, enabling the timer interrupt is safe, but the next combination, I am not sure, maybe the CC1 interrupt is fired immediately, depending on the actual value of count. Better to sync to be sure!!!

STEP_TIMER->COUNT16.CC[0].reg = 50000;
while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);
STEP_TIMER->COUNT16.INTENSET.bit.MC1 = 1; // Enable CC1 interrupt

In the next code cync can be omittted because set_step_outputs doesn't depend on setting the count register

STEP_TIMER->COUNT16.COUNT.reg = 0;
while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);
set_step_outputs(stepper->step_outbits);

In the next code, sync can be omitted. The return from the call will sure be more than a few clock tics. Before the next access to this same register, it will be in sync

       set_step_outputs(stepper->step_outbits);
        STEP_TIMER->COUNT16.CTRLBSET.reg = TC_CTRLBCLR_CMD_RETRIGGER;
        while(STEP_TIMER->COUNT16.STATUS.bit.SYNCBUSY);

AMASS is "overclocking" the stepper ISR and is enabled by default. There are three levels.

Now I remember, I am getting old to.

I have found the sync error on page 453 serial com description

If an operation that requires synchronization is executed while the corresponding SYNCBUSY bit is one, a peripheral bus error is generated.

I didn't find a flag to check, have read the manual twice now.

HuubBuis commented 4 years ago

I have found the sync error on page 453 serial com description

If an operation that requires synchronization is executed while the corresponding SYNCBUSY bit is one, a peripheral bus
error is generated.

I didn't find a flag to check, have read the manual twice now.

I have read it for the third time, no "error flag found". Maybe it is shown in one of the cpu registers during debugging! Will look into it this again...