Standardize test configuration

j0sh commented 5 years ago

It would help to have a single configuration with known performance characteristics that can be reused across a variety of test cases. This gives us a few benefits:

Better comparability across various test cases
Reduced variability and improved reproducibility
Minimizes guesswork when designing tests
Avoids the combinatorial explosion in testing variations
Better understanding of behavioral characteristics of the system

Here are some parameters to standardize on. Not all parameters will be applicable to all types of tests. This list is just a start, and not meant to be exhaustive:

Transcoding options
Number of segments to run the test over
Number of concurrent streams
Source stream and its characteristics
VM instance types and configurations (-maxSessions etc)
Number of nodes on the network and how much to provision per X streams
Characteristics of the load. How quickly is it ramped up? How long does one stream last?

Often we may actually need to test variations on these standard parameters. That is totally OK and expected! By starting with a known standard configuration, we can keep other variables steady and isolate changes to only one parameter at a time. This allows us to see the effect of just that one parameter change.

angyangie commented 5 years ago

After doing a lot of testing here and here, seems like the following could make sense as standard test configuration.

Transcoding options: P240p30fps16x9, P360p30fps16x9, P720p30fps16x9
Number of segments to run the tests over: 297 segments (the length of a pre-transcoded BigBuckBunny). Depending on what kind of output is needed, use the following.

To output an mp4:

ffmpeg -i BigBuckBunny.mp4 -c:a aac -ac 2 -ar 44100 -c:v libx264 -keyint_min 30 -g 30 official_test_source.mp4

To output .ts segments and an .m3u8 playlist:

ffmpeg -i BigBuckBunny.mp4 -c:a aac -c:v libx264 -keyint_min 30 -g 30 -f hls -hls_time 2 -hls_playlist_type vod hls/bbb.m3u8

Source stream: Big Buck Bunny (pre-transcoded to standardize segment length)
Vm instance types: Machine type: n1-highcpu-4, Manager MachineType: n1-highmem-4, Broadcaster MachineType: n1-highcpu-4, Streamer MachineType: n1-highcpu-4
Number of nodes on the network and how much to provision per X streams: if testing multi-O, 1 O + T (on single node) per stream. If testing multi-T, one O + 3 Ts per stream (all Os and Ts on separate nodes).
Characteristics of the load. Run using the test-harness stream simulator. The simulator runs the different streams as tasks within a Promise.all function, which internally is a foreach-like loop. This loop preserves the order of iterable tasks. The resolved promise values are returned in order regardless of when they get resolved. Promise.all(), therefore, executes the promises in series.

The purpose of these standards is to have a baseline from which to run tests from. Changing just one parameter each time a test is run, will allow for more equal comparisons of performance across tests run by different indvidiuals on the team. 99.9999% reliability can be achieved using very different configurations - it's important that we're able to define what those configurations are, and we all have visibility into that process.

What do people think?

ya7ya commented 5 years ago

Number of nodes on the network and how much to provision per X streams: 10 O's, 10 T's

does that mean 1 to 1 ??

angyangie commented 5 years ago

Yes, thanks for pointing that out. I meant to say 1 to 1! I've updated the comment

j0sh commented 5 years ago

the length of a pre-transcoded BigBuckBunny

Can you describe how this was pre-transcoded? And can the results be put somewhere for further testing?

Number of concurrent streams: one per orchestrator/transcoder pair 1 O and 1 T per stream.

(Are these saying the the same thing?) Does this mean separate nodes, with a T attached to each O ? Or a combined O+T node?

Characteristics of the load.

How is are the streams generated and how is the load ramped up?

angyangie commented 5 years ago

Can you describe how this was pre-transcoded? And can the results be put somewhere for further testing?

Yes! I'll add it to my initial note. Two ways:

To output an mp4: ffmpeg -i BigBuckBunny.mp4 -c:a aac -ac 2 -ar 44100 -c:v libx264 -keyint_min 30 -g 30 official_test_source.mp4
To output .ts segments and an .m3u8 playlist: ffmpeg -i BigBuckBunny.mp4 -c:a copy -c:v libx264 -keyint_min 30 -g 30 -f hls -hls_time 2 -hls_playlist_type vod hls/bbb.m3u8

(Are these saying the the same thing?) Does this mean separate nodes, with a T attached to each O ? Or a combined O+T node?

Good to clarify. Since we do have multi-T support, I thought we'd include the separate O and T as a baseline. One T per O. Are there other considerations to keep in mind here?

How is are the streams generated and how is the load ramped up?

If we're generating using the stream simulator, should we further define how the streams are being generated and how the load is ramped up? It seems to me like the Promise.all within the test-harness does all of that for us?

j0sh commented 5 years ago

Two ways:

Audio is copied in one but encoded in another; is there a reason for that?

I thought we'd include the separate O and T as a baseline. One T per O. Are there other considerations to keep in mind here?

We should definitely add test cases around multi-T, but we probably should distinguish between when we're testing B/O interaction, and when we're testing O/T . Right now it's mostly the former, so maybe keep O+T paired up to minimize the moving parts there.

Also, I'm still unsure of how standalone T works on the test harness. Do O and T each run on dedicated machines? In that case, having a single T per O seems strange given that T is fundamentally a scalability mechanism for O, rather than a reliability mechanism, so we're leaving a lot of capacity on the table with just one T per O. Maybe @ya7ya can elaborate.

If we're generating using the stream simulator, should we further define how the streams are being generated

Yes. I don't know what Promise.all is. The questions I have are:

How does the stream simulator work? Eg, What machines are they running off? What's the command line used to generate the streams? How many streams per machine or container?
How is the load generated. Is it bang, all at once? Trickled out? If so, at what rate?

This may just be "putting into words" the existing behavior of the test harness but that is OK; the intent is to document that.

ya7ya commented 5 years ago

Also, I'm still unsure of how standalone T works on the test harness. Do O and T each run on dedicated machines? In that case, having a single T per O seems strange given that T is fundamentally a scalability mechanism for O, rather than a reliability mechanism, so we're leaving a lot of capacity on the table with just one T per O. Maybe @ya7ya can elaborate.

good question, in the case of multi-T per O support, the resource allocation for T should be 1 per machine, But I don't think the same should be applied to O in this case given it's not actually doing the work.

How does the stream simulator work? Eg, What machines are they running off? What's the command line used to generate the streams? How many streams per machine or container?

stream simulator containers run on the same deployment, but they don't run on the same machines as standalone Os or Ts, and have less reserved resource compared to livepeer containers, their distribution is left to docker swarm to figure out.

How is the load generated. Is it bang, all at once? Trickled out? If so, at what rate?

There is a DELAY variable assigned to each stream simulator container , which defines delay in seconds before starting this stream, it's randomly chosen between 0 and 60 seconds.

angyangie commented 5 years ago

Good question, in the case of multi-T per O support, the resource allocation for T should be 1 per machine, But I don't think the same should be applied to O in this case given it's not actually doing the work.

Interesting. So @ya7ya, as the system works now, does each T run on its own single machine, and each O shares resources with others? How do we determine how many O's can run on a single machine? What if there aren't enough machines for all T's specified - how do we handle that now?

Stream simulator containers run on the same deployment, but they don't run on the same machines as standalone Os or Ts, and have less reserved resource compared to Livepeer containers, their distribution is left to docker swarm to figure out.

How does having less reserved resource affect stream play?

There is a DELAY variable assigned to each stream simulator container , which defines delay in seconds before starting this stream, it's randomly chosen between 0 and 60 seconds.

Does that mean that not all streams are started at the same time? If I'm understanding correctly, after 60 seconds, all streams should have started?

Audio is copied in one but encoded in another; is there a reason for that?

No reason for that. I didn't compare the commands closely enough here. I assume the standard should be to encode audio, and can change the latter command to do so.

We should definitely add test cases around multi-T, but we probably should distinguish between when we're testing B/O interaction, and when we're testing O/T . Right now it's mostly the former, so maybe keep O+T paired up to minimize the moving parts there.

Good point. I'll break those two test-cases up so we have a standard for both. One stream per O+T pair. When O and T are separate, for multi-T testing, perhaps the standard should be: 3 T's per O, 3 streams per O. It seems like that should allow us to test failover scenarios and O's ability to handle multiple streams.

darkdarkdragon commented 5 years ago

@angyangie

Promise.all(), therefore, executes the promises in series.

Promise.all doesn't execute anything, it just waits for all promises to complete. Execution of promise starts at the moment of promise creation. That means that promises sent to Promise.all already executing.

after 60 seconds, all streams should have started?

Yes

How does having less reserved resource affect stream play?

I think that letting docker to share any resources (scatter streamers across machines, share number of Os on save machine) hurts reproducibility of tests, so I plan to change that so no resources will be shared.

angyangie commented 5 years ago

Promise.all doesn't execute anything, it just waits for all promises to complete. Execution of promise starts at the moment of promise creation. That means that promises sent to Promise.all already executing.

Good to know!

I think that letting docker to share any resources (scatter streamers across machines, share number of Os on save machine) hurts reproducibility of tests, so I plan to change that so no resources will be shared.

Can you point me to the code that allows for the scattering of streams across machines? Curious to check it out.

livepeer / test-harness

Standardize test configuration #58