Performance testing and improvement

As we approach our v1 release, it is important that we develop and maintain methodologies for testing performance of various parts of the system and occasionally checking for regressions. This epic holds the high level tasks for managing that work.

Event delivery

We currently have several ways of delivering xAPI events to ClickHouse. For each of these we should document a methodology and reference configuration for testing throughput of events to ClickHouse before issues emerge (queues filling up, task sizes growing, delivery time lagging, etc).

We should be able to emulate traffic using tracking log replay of very large files with a batch size of 1 and adjust the sleep setting to the maximum the backend can handle. If the backend can take a 0 sleep loop, we should run additional processes until it breaks.

[x] #196

We should be careful to constrain the configurations to be roughly equivalent in resources / cost, to emulate a production environment for a mid-sized system, and to use the same version of Aspects for all tests.

[ ] #202
[ ] #203
[ ] #204
[ ] #205

Template for reporting results:

Test system configuration:
- Tutor version
- Aspects version
- Environment specifications (local / k8s, CPU / Memory / Disk resources allocated)

Load generation specifications:
- Tool
- Exact script
- Any custom settings for things like sleep time and # of processes

Data captured for results:
- Length of run
- Sleep time / batch size
- We should have values for these every 10 seconds:
  - Latency of events in ClickHouse (now - most recent event)
  - Queue size (if applicable) ex: pending tasks in celery, pending stream size in redis, etc
  - Total events in CH
  - Query times for 2-3 ClickHouse reporting queries (as taken from Superset)

Query performance

On a load test dataset, check every reporting query we have (as captured from the "show SQL" in Superset), with and without any applicable filters to see how they perform. We should run the queries 5x each and capture the response times and number of rows returned. It should also be possible to capture the queries by browsing each chart, using different filters, then pulling the SQL from the ClickHouse logs.

We should be careful to capture the xapi-db-load configuration for generating the data so we can regenerate as necessary.

[x] #208
[x] #210
[x] #209
[x] #211
[x] #213
[x] #212

Template for reporting results:

Test ClickHouse configuration
- local / k8s/CH Cloud, Altinity...
- hardware or config specs
- Total rows in ClickHouse

For each query
- Query short name (enrollments no filter, enrollments enrollment type filter)
- Raw query
- Duration
- Rows returned

openedx / openedx-aspects

Performance testing and improvement #195

Event delivery

Query performance