openedx / openedx-aspects

Aspects - Analytics for Open edX
Apache License 2.0
6 stars 6 forks source link

Performance testing and improvement #195

Open bmtcril opened 4 months ago

bmtcril commented 4 months ago

As we approach our v1 release, it is important that we develop and maintain methodologies for testing performance of various parts of the system and occasionally checking for regressions. This epic holds the high level tasks for managing that work.

Event delivery

We currently have several ways of delivering xAPI events to ClickHouse. For each of these we should document a methodology and reference configuration for testing throughput of events to ClickHouse before issues emerge (queues filling up, task sizes growing, delivery time lagging, etc).

We should be able to emulate traffic using tracking log replay of very large files with a batch size of 1 and adjust the sleep setting to the maximum the backend can handle. If the backend can take a 0 sleep loop, we should run additional processes until it breaks.

We should be careful to constrain the configurations to be roughly equivalent in resources / cost, to emulate a production environment for a mid-sized system, and to use the same version of Aspects for all tests.

Template for reporting results:

Test system configuration:
- Tutor version
- Aspects version
- Environment specifications (local / k8s, CPU / Memory / Disk resources allocated)

Load generation specifications:
- Tool
- Exact script
- Any custom settings for things like sleep time and # of processes

Data captured for results:
- Length of run
- Sleep time / batch size
- We should have values for these every 10 seconds:
  - Latency of events in ClickHouse (now - most recent event)
  - Queue size (if applicable) ex: pending tasks in celery, pending stream size in redis, etc
  - Total events in CH
  - Query times for 2-3 ClickHouse reporting queries (as taken from Superset)

Query performance

On a load test dataset, check every reporting query we have (as captured from the "show SQL" in Superset), with and without any applicable filters to see how they perform. We should run the queries 5x each and capture the response times and number of rows returned. It should also be possible to capture the queries by browsing each chart, using different filters, then pulling the SQL from the ClickHouse logs.

We should be careful to capture the xapi-db-load configuration for generating the data so we can regenerate as necessary.

Template for reporting results:

Test ClickHouse configuration
- local / k8s/CH Cloud, Altinity...
- hardware or config specs
- Total rows in ClickHouse

For each query
- Query short name (enrollments no filter, enrollments enrollment type filter)
- Raw query
- Duration
- Rows returned
bmtcril commented 3 months ago

All data collected for the first sets of load tests are in these two files:

load_test_stats_1.txt load_test_runs_1.txt

The Superset dashboard I'm using along with associated datasets can be imported from this zip:

dashboard_export_20240404T203605.zip