trilliumeng / orion-sdk

Public SDK and ICD for Trillium Engineering's Orion Gimbals.
http://www.trilliumeng.com
MIT License
15 stars 15 forks source link

fix for 100% cpu usage from non-blocking sockets #1

Closed jasonbeach closed 6 years ago

jasonbeach commented 6 years ago

I noticed that using the sdk caused the cpu usage by the application to go to 100%, which is caused by the non-blocking sockets in combination with a while loop. This fix makes the socket blocking, but only for 100ms at a time so the broadcast detection still works without locking the application up when no gimbal is attached.

jefffisher commented 6 years ago

Hi Jason,

I am trying to test this change but am unable to reproduce your CPU loading issues. What platform are you seeing this on? Also, I am running the UserData example application – can you build and run that and verify that you still see this issue?

Thanks!

jasonbeach commented 6 years ago

I have seen it both on my linux desktop and the TX1 we are using it on. I ran the UserData program and didn't see it but it looks like in the ProcessKeyboard function of that program, there is a blocking read (on stdin) that would prevent the 100% cpu. There is also 250ms delay between instances when it reads from the gimbal.

In our application we want to pull packets off the socket as soon as they arrive (and publish the telemetry / GPS data they contain) so we are constantly reading from the socket. Since the socket is non-blocking, this drives the cpu to 100%. I think there are two ways to handle this--make the socket blocking (which I know works) or use the select(2) function to check to see if there is any data to be read on block on it (I haven't used this, but I think that's how it works). Making the socket blocking (with a timeout) is certainly easier, but I guess it would make things tricky if you were trying to read from multiple gimbals as the same time.

Thanks!

jefffisher commented 6 years ago

Actually, the read from stdin is non-blocking courtesy of these settings that are passed to the TTY by tcsetattr:

151         New.c_cc[VMIN] = New.c_cc[VTIME] = 0;

So if you were to profile it, you'd find that the UserData application actually spends most of its time in the usleep(250000) call at line 79. In our example applications, we avoid hogging the whole CPU by adding a little sleep between polls of the TCP socket, the length of which which varies greatly depending on the application.

I'd guess that your application's CPU load would drop to near zero with even just a 5ms sleep between calls to OrionCommReceive. Without the sleep, it'll enter that function many hundreds or thousands of times per second, until the CPU finally saturates.

If it works for you, I'd suggest trying out the non-blocking implementation in an architecture similar to the one we use in our examples. Here's a brief outline:

// Loop forever
while (1)
{
    // Empty the TCP buffer, looking for any packets that are available on the port
    while (OrionCommRecieve(&Pkt))
    {
        // Process packets here
    }

    // Run the whole thing at ~50 Hz (Note: No packet is sent at a rate of >10 Hz)
    usleep(20000)
}

Give that a try and let me know what you think. If that's not going to work out, I'm more than happy to further discuss the prospect of a blocking read with you.

jefffisher commented 6 years ago

Just a quick follow-up: Running the LineOfSight example on my linux VM, here are some benchmarks with various sleep times:

 usleep  Calls/s    CPU %
==========================
 0       2,800,000  99%
 1000    800        3%
 5000    180        1%
 20000   49         1%
jasonbeach commented 6 years ago

That sums up what I was seeing--making the socket blocking with a timeout had the same effect. It works well so we'll probably just go with it. I thought I'd share it with you if you were interested, if not that's totally ok.