Output jitter/latency after interrupt (caused by the network components ?)

rsta2 / circle

A C++ bare metal environment for Raspberry Pi with USB (32 and 64 bit)

https://circle-rpi.readthedocs.io

GNU General Public License v3.0

1.86k stars 249 forks source link

Output jitter/latency after interrupt (caused by the network components ?) #86

Closed fmcoelho8 closed 6 years ago

fmcoelho8 commented 6 years ago

Hello,

My program needs to receive some bit patterns through TCP/IP and then write them to an output after a synchronization signal. To make it easier to understand it would be used to print a layer in a micro-scale printer, so the bit patterns being send are the lines of the layer.

My program is working as expected, the TCP/IP communication is working perfectly and an (fast) interrupt is being used to start the writing, listening to the sync signal.

The problem that I found was that there is a a random latency between the synchronization signal and the time it is starting to write the bitpatern(varying from 20 to 50 ns). As you can imagine this will cause misalignment between the lines, since the lines are not being written with a fixed latency after the sync signal. A jitter of 5 ns would be acceptable. Note that the latency itself is not a problem if it has a fix value.

Another aspect that I noticed is that if I hard code the bit pattern in the Pi code and remove all the network, removing CDWHCIDevice, CScheduler and CNetSubSystem, there is no jitter in the output, it works smoothly and precisely. However I really need to have the TCP/IP communication.

Therefore, my questions are: 1) why adding the network (TCP/IP communication) is causing random latency reacting to the interrupt and how can I have all the core processing power to execute the writing of the bit pattern, in order to remove the jitter/random latency ? 2) is there any way to halt all the communication layers when I want to start the writing process? This might be enough to correct the problem.

Notes:

I'm using a Raspberry Pi 3+.
Before starting to listen to the fast interrupt I'm entering in critical mode (using EnterCritical(IRQ_LEVEL); ).
I'm using "#define REALTIME" in sysconfig.h, because it is a time critical program
I suspect that the CDWHCIDevice instantiation is causing some delays in the program execution.

Looking forward to hear from you

MaPePeR commented 6 years ago

If your bit pattern is only one or two channels and you don't need audio I would suggest using the PWM hardware in serial mode to write your pattern.

Then you can use DMA and write your pattern into a ring buffer instead of setting a pin bit by bit. (if I understood your problem correctly)

fmcoelho8 commented 6 years ago

If your bit pattern is only one or two channels and you don't need audio I would suggest using the PWM hardware in serial mode to write your pattern.

Then you can use DMA and write your pattern into a ring buffer instead of setting a pin bit by bit. (if I understood your problem correctly)

Hey MaPePer,

Thanks for your answer. In my case I really need to write the pattern setting a pin bit by bit, here I have no choice. This pin will be the input of another hardware component, which need to receive the bit pattern bit by bit as a digital value. What I really need to know is how to remove the jitter, because apart from that it is working fine. I also want to keep the TCP/IP communication because it is fast to send all the data through it, I'm just wondering why the network is causing this problems, since without the network I have the expected result, as I described.

Regards

MaPePeR commented 6 years ago

In my case I really need to write the pattern setting a pin bit by bit, here I have no choice. This pin will be the input of another hardware component, which need to receive the bit pattern bit by bit as a digital value.

This is exactly, what the PWM Serial mode lets you do, i think. It can - without CPU interaction, so without jitter - read the bits from memory and write them out as high/low on a pin that is pwm compatible.

Edit: or is your pattern across multiple pins? I understood pattern as a series of bits (in time) not in space.

fmcoelho8 commented 6 years ago

In my case I really need to write the pattern setting a pin bit by bit, here I have no choice. This pin will be the input of another hardware component, which need to receive the bit pattern bit by bit as a digital value.

This is exactly, what the PWM Serial mode lets you do, i think. It can - without CPU interaction, so without jitter - read the bits from memory and write them out as high/low on a pin that is pwm compatible.

Edit: or is your pattern across multiple pins? I understood pattern as a series of bits (in time) not in space.

Ok, sounds interesting, although I have no problems reading the bits from memory and outputting them. The only problem is that it is not starting to write with a fix latency from the sync signal when I'm enabling the network (TCP/IP). So, I'm not convinced that the PWM serial mode could help here.

Answering to your question: the pattern is outputted from a single pin.

MaPePeR commented 6 years ago

Ok, sounds interesting, although I have no problems reading the bits from memory and outputting them. The only problem is that it is not starting to write with a fix latency from the sync signal when I'm enabling the network (TCP/IP). So, I'm not convinced that the PWM serial mode could help here.

Its not easier, but harder to read the bits from memory and send them with PWM, but it comes with the big benefit, that you can set a fixed bitrate and don't have to do the timing in the CPU, so the CPU can do other stuff in the meantime.

The "phase" of the signal is not so easy to control, though. Only in relation to the second pwm channel.

jsanchezv commented 6 years ago

Hi fmcoelho8,

How many CPU's are you using?. I suggest to handle the TCP/IP comm on CPU#0 and handle the bit output from CPU#1.

Anyway, the LAN device is a USB 2.0 device, so can have time limitations. I suggest that you can create a receive buffer in CPU#0 and consume it on CPU#1. CPU#0 handles all timers and USB transactions and his scheduler is fairly simple.

rsta2 commented 6 years ago

The latency is probably caused by the Circle scheduler, which is continuously reading the system timer register. Unfortunately reading this register takes some nanoseconds. The CPU cannot be interrupted while this takes place. For the moment I can suggest two things:

You can try the system option USE_PHYSICAL_COUNTER in include/circle/sysconfig.h. Circle is using an other CPU internal physical counter then. I guess, latency should drop then, but I didn't make measurements on this so far. So if you can try this and do measure the latency, this would be very interesting, even for me.
If 1. does not help, you can "switch off" the network code simply if you do not call CScheduler::Yield() or CScheduler::Sleep() any more for some time. Of course network connections will not be maintained then, but for some seconds or if there is no active connection at the moment, this should be possible. When you call CScheduler::Yield() or CScheduler::Sleep() later again, the network should continue to be serviced.

fmcoelho8 commented 6 years ago

Hi jsanchezv, I'm not using multi-core, because I need full CPU to Output my bit-pattern. It has to be as fast as possible, the goal is to reach around 60 MHz. If I do what your are suggesting I'm afraid that I will not reach the speed I desire.

Hi rsta2, First of all, thank you for the circle code !

I've tried the USE_PHYSICAL_COUNTER system option and I didn't see a significant different in the latency, it is basically the same . There is an aspect that I forgot to mention, the write of each line is controlled through a TCP/IP command, to which the device should acknowledge after finishing to write a line. Saying this, I would like to avoid shutting down the network during the program execution.

I will share some code to make it easy to understand my implementation:

kernel.cpp

TShutdownMode CKernel::Run (void)
{   
    CGPIOPin indicator(4, GPIOModeOutput);
    indicator.Write(HIGH); // Led to indicate that the Pi is running
    currentLayer = nullptr;
    writeOnInterrupt = false;

    #ifdef ENABLE_NETWORK
    CString IPString;
    m_Net.GetConfig ()->GetIPAddress ()->Format (&IPString);

    new DataReceiver (&m_Net);
    #endif

    inpin.ConnectInterrupt(&handleFIQ, this);
    inpin.EnableInterrupt(GPIOInterruptOnRisingEdge);

    while(true) {
        m_Scheduler.Yield();
    }

    return ShutdownReboot;
}

output.cpp -> handles the fast interrupt (based on example 30-gpiofiq) and output the bit pattern, writing 4 bytes every cycle

void handleFIQ(void* param) { 
    if (writeOnInterrupt) {
        writeOnInterrupt = false;
        uint32_t* currentData;
        for (size_t i = 0; i < lineLength; i++) {
                        currentData = &writeBuffer[i];
            write(currentData);
        }
        endWritingFlag = true;
    }
}

inline void write(uint32_t *data) {
    outpin.WriteFast(BIT(*data, 7));
    outpin.WriteFast(BIT(*data, 6));
    outpin.WriteFast(BIT(*data, 5));
    outpin.WriteFast(BIT(*data, 4));
    outpin.WriteFast(BIT(*data, 3));
    outpin.WriteFast(BIT(*data, 2));
    outpin.WriteFast(BIT(*data, 1));
    outpin.WriteFast(BIT(*data, 0));
(... the same for the rest of the uint32)

datareceiver.cpp -> handles the socket and decodes the received data (based on the example 20-tcpsimple)

//Function that process the already completed received packet
void DataReceiver::processPacket(uint8_t* dataBuffer, uint8_t* reply) {
    CKernel* kernel = CKernel::Get();
    uint8_t packetType = dataBuffer[0];
    if (packetType == 0) { // packet that contains the bit patterns 
        (... implementation)
    } else if (packetType == 1) {// command to write  a line
        if (actionType == 0) { // write line
            if (kernel->currentLayer != nullptr) {
                if (lineIndex < kernel->currentLayer->numberOfLines) {
                    kernel->currentLayer->currentLine = lineIndex;
                    memcpy(writeBuffer, kernel->currentLayer->buffer[lineIndex], kernel->currentLayer->lengthOfLine * sizeof(uint32_t));
                    lineLength = kernel->currentLayer->lengthOfLine;

                    SyncDataAndInstructionCache();
                    EnterCritical(IRQ_LEVEL);

                    endWritingFlag = false;

                                        // weirdly, adding this log here reduces the latency
                    CLogger::Get()->Write(FromDataReceiver, LogNotice, "Before set write flag"); 

                    writeOnInterrupt = true;
                    while(!endWritingFlag) {

                    }
                    LeaveCritical();
                    createReply(reply, REPLY_SUCCESS);
                                        (....)

Do you have any suggestion to reduce the latency jitter ? The lowest latency that I could reach so far was oscillating between 302 and 324 ns, I would like to reach a jitter of aroung 4 ns maximum.

Plus, do you have any idea why the log before changing the writeOnInterrupt flag is reducing the jitter ?

Looking forward to hear your suggestions

MaPePeR commented 6 years ago

What do you mean by latency? 302ns latency = length of 1 bit using your output method, so ~3.311MHz?

fmcoelho8 commented 6 years ago

What do you mean by latency? 302ns latency = length of 1 bit using your output method, so ~3.311MHz?

Hey MaPePer,

What I mean by latency is the time between the sync signal and the first bit being outputted. Have in mind that I'm expecting to have latency, when the sync signal arises there will be some latency, the Pi needs to fire the interrupt and so on. I'm just trying to have a fixed latency, in order to start to write the pattern at the same time from the sync signal, to ensure the lines vertical alignment.

Thanks

rsta2 commented 6 years ago

I'm sorry, but your requirements (4 ns maximum latency on reaction on a rising edge on a GPIO pin) cannot be fulfilled by a Raspberry Pi 3B+, not with Circle and not with any other software. The reason for this is the speed of the I/O block of the RPi 3B+, which is far behind the speed of the Cortex-A53 core. You can reach a maximum read rate from GPIO of about 10 MHz 32-bit words at bare metal CPU clock of 600 MHz. The absolute maximum read speed with the CPU running at 1400 MHz is about 18 MHz 32-bit words.

You can verify this using the sample/11-gpioclock which reads the GPIO level register at the fastest possible speed in a tight assembler loop. The sample program shows the sample rate, which has been reached. This program runs at the normal bare metal CPU clock of 600 MHz.

If you want to check this at the maximum CPU clock of 1400 MHz, you can try this sample program and create a file cmdline.txt with the contents fast=true on the uSD card. Please see the README file for more infos. Please add a value of 20000 to this line before build, so that the sample recorder tries to reach this sample rate (in KHz):

unsigned CRecorder::s_nWantedRateKHz[] = {100, 1000, 2000, 5000, 10000, 15000, 20000};

You will only get about 18 MHz sample rate. More is not possible.

Your application uses the FIQ to trigger the input process, but this will not be faster than reading the GPIOs in a tight assembler loop. If you are interested in how the edge detection is working, please have a look at this document page 97 and page 89 for a block diagram of the GPIO block. The GPIO block samples three bits in a row to detect (synchronous) rising edges. That means the theoretical minimum of the latency is 1/18 MHz 3 = 167 ns (at 1400 MHz CPU clock). At 600 MHz CPU clock you will get 1/10 MHz 3 = 300 ns. That's the latency, you have measured.

You could try to use GPIOInterruptOnAsyncRisingEdge for edge detection to reduce the latency by factor 3 (see page 99 in the document above), but that would be still 1/18 MHz = 55 ns latency. But I haven't tried this yet, so this is not sure.

MaPePeR commented 6 years ago

Ah. I'm stupid. I kind of forgot about the fact, that you are waiting for a GPIO signal/interrupt to start your send-process.

If I understood your problem correctly your problem is not the size of the latency, but the jitter in the latency? Don't see a way to prevent that, though. It would be a lot easier if the raspberry would create the sync signal and not read it, i guess.

fmcoelho8 commented 6 years ago

Hey MaPeper,

If I understood your problem correctly your problem is not the size of the latency, but the jitter in the latency?

Yes, that is indeed my problem.

Hey rsta2,

Maybe I mislead you with my explanation, but maybe with MaPePer question it is easy to understand my problem. I really understand that there will latency and I'm ok with that, I'm just trying to reach a fix latency for every time I want to react to the input signal (interrupt). So, I can have 300 ns of latency, no problem, but I want to have that 300 ns latency every time I want to write a new bit pattern after waiting for the sync signal. Instead of that, right now I have sometimes 300 ns of latency and in the next bit pattern, following exactly the same procedure, I'm getting 322 ns of latency. This is what I want to avoid, or at least have a maximum of 4 ns of difference between the current and the previous latency values (so if the latency varies in the range of 300 and 304 ns it would be ok).

I think with this last comment it is easier to understand what is my problem.

Thank you

rsta2 commented 6 years ago

OK, it's the jitter. I think, I understand your problem.

Normally I would suggest to implement such hard real-time code like yours without using the FIQ like that:

while (!(read32 (ARM_GPIO_GPLEV0) & (1 << GPIO_PIN)))
{
    // wait for the trigger signal
}

// do the output now
// ...

But I think, even with that you will be far away from a jitter of 4 ns, because reading the GPIO level takes 55 ns (even at 1400 MHz CPU clock) and the input can go high at any time, while it is read.

What I wanted to explain in my previous comment is, that you are very near (or maybe beyond) the capabilities of the GPIO block of the RPi 3B+.

fmcoelho8 commented 6 years ago

Hey rsta2,

Thanks for your suggestion. I've tried it and I still get unacceptable jitter. With the Fast interrupt solution I can get a more stable latenc.

I know that I'm very near the capabilities of the RPS3+ GPIO, but bare with me, if I remove the network components I can reach the desire result, having latency variations of +/- 4 ns (jitter), so, something inside the network components is messing with the CPU.

So, in order to make it work with the network I was thinking to change the scheduler, in order to lock the current task. Doing the following: Adding a control variable to avoid the scheduler to swap to the next task

void CScheduler::BlockYield(bool block)
{
    blockYield = block;
}
void CScheduler::Yield (void)
{
    if(blockYield)
        return;
       (...)
}
void CScheduler::usSleep (unsigned nMicroSeconds)
{
        if(blockYield)
        return;
       (...)
}

Then in when I want to have full CPU, when I want to process the completed packet I would do something like: datareceiver.cpp

// code to receive data
(...) 
        CScheduler::Get()->BlockYield(true);
    SyncDataAndInstructionCache();
    EnterCritical(FIQ_LEVEL);
    processPacket(dataBuffer, replyPacket); // method below
    LeaveCritical();
        CScheduler::Get()->BlockYield(false);

DataReceiver::processPacket(uint8_t* dataBuffer, uint8_t* reply) {
    CKernel* kernel = CKernel::Get();
        //verification and initialization 
    (...)
    if (packetType == 1) { //check if layer exist
        // check if index is in layer.
        uint8_t actionType = dataBuffer[5];
        uint32_t lineIndex = dataBuffer[6] << 24 | dataBuffer[7] << 16 | dataBuffer[8] << 8 | dataBuffer[9];

        if (actionType == 0) { // write line
            if (kernel->currentLayer != nullptr) {
                if (lineIndex < kernel->currentLayer->numberOfLines) {
                    kernel->currentLayer->currentLine = lineIndex;
                    memcpy(writeBuffer, kernel->currentLayer->buffer[lineIndex], kernel->currentLayer->lengthOfLine * sizeof(uint32_t));
                    lineLength = kernel->currentLayer->lengthOfLine;

                    endWritingFlag = false;

                    CLogger::Get()->Write(FromDataReceiver, LogNotice, "Before set write flag"); 

                    writeOnInterrupt = true;
                    while(!endWritingFlag) {
                    }

                    createReply(reply, REPLY_SUCCESS);
(....)
}

While running the program with the above changes I got better results, however not as good as without the network at all. Plus, I still didn't understand why adding a log message before setting the interrupt flag (writeOnInterrupt) is generating lowest latency and a lowest jitter than without the log. I've also tried to remove all the logging mechanism, weirdly, I had worst results.

Do you have any suggestion to reach the same results that I got without network? Shouldn't my Yeild block variable assure full CPU power?

Thanks!

rsta2 commented 6 years ago

I think, this BlockYield() is not needed, because inside processPacket() CScheduler::Yield() is not called. So it does not need to be blocked. The main task in CKernel::Run() or the network code will not run, if one task is not calling CScheduler::Yield() for some time. The scheduler requires, that all existing tasks are calling CScheduler::Yield() (or Sleep()) from time to time. On the other hand, if one task is not doing this, it will remain running until it is calling it. So the network stack should already be stopped for the moment, processPacket() is running (!).

But there is another side effect, which is caused by the cache. The cache can only hold a limited amount of code and data at a time and the network code and the USB code (which is used by the network) is relatively large. It's likely, that it does not fit into the cache at once. That's why it is possible, that when the network code is running there have transfers between cache and SDRAM to be done, which takes time. Unfortunately this behaviour is stochastic and difficult to cover.

The behaviour with the logging message may also be caused by such caching effects.

fmcoelho8 commented 6 years ago

Hey rsta2,

We couldn't manage to have a stable output using the solution described previously, it seems that the network influences a lot the performance.

To reach better results we split the implementation in two Raspberry Pies, one in charge of the TCP/IP communication and the other one in charge to Load the bit patterns from a USB memory stick and output the data. To communicate between both Raspberry Pies, to send the write command and to send the write done acknowledge, we are using the GPIOs.

Splitting the solution is showing better results, it seems that the Network adds a lot of load in the Raspberry.

Regards,

rsta2 commented 6 years ago

With these hard real-time constraints you have, the solution you have described, is probably good. Yes, the Circle network stack influences the timing of the Raspberry Pi to some degree. I suppose, the influence is less critical with microseconds-latency requirements, but you are in the nanoseconds. Thanks for info.