protocol / prodeng

Issues, discussions and documentation from the production engineering team
2 stars 1 forks source link

Thunderdome: Scale Up Experiments #18

Closed iand closed 2 years ago

iand commented 2 years ago

What Is It?

Enable the use of bigger virtual machine instances and replay significant fractions of live traffic to the gateways in an experiment.

Why Are We Doing It?

The previous phases were about building baseline functionality. This phase is about scaling Thunderdome so that experiments can be performed closer to production/live conditions. By hooking into the live gateway request logs we are more likely to manifest the types of bottlenecks and problems that are faced by gateway instances in production. The dynamic nature of the IPFS network makes a replay of live requests for current data better than a frozen corpus that goes stale over time. Also because we will be sending production level rates of requests to each gateway under test we need to be able to give them more resources to cope with the higher load.

Notes

This phase is all about enabling experiments that reflect performance similar to real world loads: better backend infrastructure, replaying significant fractions of live gateway logs

The log stream should permit scalably and promptly sending logs that permit high-fidelity playback to a large number of dealgood driven experiments.

We can use the existing logs - but we dont for example know POST payloads (if any), ranges for range requests, what request headers were set, etc - so we should probably make a separate log for this purpose

We should pick a (or trial several) log streaming or messaging system(s) to see if one meets our needs (bonus points if its hosted)

Project overview is on Notion

Tasks