tvkitchen / countertop

The entry point for developers who want to set up a TV Kitchen.
https://tv.kitchen
GNU Lesser General Public License v3.0
6 stars 2 forks source link

Optimize ingestion performance #77

Closed chriszs closed 3 years ago

chriszs commented 4 years ago

Task

Description

After hooking up a HDHomeRun stream to the HTTP ingestion engine, I noticed the stream would cut off after a few minutes, but wouldn't emit an error. I tried debugging this, eventually pulling up the stream in Wireshark and noticing the stream would emit a TCP Window Full event directly before the packets stopped. So I disconnected the input stream from the ingestion engine and just echoed the chunks to the console. This time the stream didn't stop. I hooked up a rudimentary rate calculation and got about 215 chunks per second. When I hooked it back up to the rest of the engine I got rates in the 140 range with way slower console output, before plunging to 80 and then stopping. So, it seems like to deal with real-time TV streams we need to profile the performance of the engine, and Kafka ingestion, and try to improve it. 215 chunks per second seems to suggest our time budget is about 5 milliseconds. But remains to be seen if that's a naturally occurring rate and if it's immutable.

Related Issues

73

slifty commented 4 years ago

Interesting! In addition to trying to minimize kafka latency, we should consider dramatically increasing the size of the chunks written to kafka (right now we write a ton of very small messages; there's no good reason for that).

We could, for instance, have each CONTAINER payload ultimately be a frame worth of data (right now I believe think they're like... 10th of a frame, or possibly even smaller).

chriszs commented 4 years ago

Yeah, we should look at buffer size. It also occurs to me that JSON may be an inefficient message format for sending binary data. I wonder if we can reduce the size of what Kafka sends. Related to #69.

chriszs commented 4 years ago

Tested switching to sending binary data via Kafka just now. It did seem to improve performance, maybe enough that it won't stall. (Though performance seemed to vary a lot for reasons I couldn't fully determine. I want to understand that better.)

Significantly, though, it sends a lot fewer bytes. I listened to the Kafka output stream with kafka-console-consumer and received 465 JSON serialized messages, which when redirected to a file amounted to 19.2 MB on disk. With binary data, 741 messages took 4.8 MB. That's about 6.4 times smaller. This makes sense when you consider a single byte from a buffer encoded in JSON is typically represented by a two to three digit number and a comma in a Unicode string (e.g. 138,), or three to four times two bytes.

Here's what I'm doing to test this:

.send({
    topic: dataTypes.STREAM.CONTAINER,
    messages: [{
        value: payload.data,
        headers: {
            type: payload.type,
            position: JSON.stringify(payload.position),
            duration: JSON.stringify(payload.duration),
            createdAt: payload.createdAt,
        }
    }],
})
slifty commented 4 years ago

Oh that's awesome and makes a ton of sense. I feel lighter just reading that update!

chriszs commented 4 years ago

Becoming clear we need to have better tools for monitoring and benchmarking throughput in general. In particular, we probably want to monitor and log the rate on ingestion, and possibly throw an error if it crawls to nothing for long enough.

I've tried directly sending binary, Avro serialization and now compression, but my testing has been ad hoc and painstakingly manual and the results vary enough that I'm not sure how valid my observations are.

One anecdotal observation is that throughput seems to drop when I add a new consumer. Is that because Kafka, the consumer and producer are all competing for resources on the same machine? Or is there something else going on?

Dropping two research docs that I plan to come back to:

It seems like high throughput should be possible, and in fact I am sometimes seeing it, but sometimes throughput dips precariously.

slifty commented 3 years ago

@chriszs where are we with this issue do you think? With AvroPayload are we in reasonable shape?

I'm able to run countertops with a single station smoothly, though I don't know the actual throughput.

slifty commented 3 years ago

Update here:

  1. The stream was shutting off because of an issue where ffmpeg stderr was not being ignored and so its buffer filled and the process stopped. This has been fixed.

  2. AvroPayload is still more efficient so hooray!

There are certainly ways to make ingestion faster but we aren't hitting specific bottlenecks right now so I'm going to close this issue as complete.