servo / intermittent-tracker

A live database of intermittent test failures based on github's webhook notifications.
https://build.servo.org/intermittent-tracker/query.py?name=
Mozilla Public License 2.0
3 stars 12 forks source link

GET /dashboard/attempts becomes slow and huge as the database grows #23

Open delan opened 1 week ago

delan commented 1 week ago

With ~71K rows in "attempt" (~29 days worth), the response with no filters or since is over 40 MB, which is very unwieldy, even though the dashboard only makes this request once per page load. The endpoint also takes 380 ms to start sending a response, >210 ms of which is in DashboardDB.select_attempts:

$ time curl -fsSIo /dev/null http://localhost:5001/dashboard/attempts
real    0m0.378s
Oct 31 05:53:14 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 05:53:14,419] DEBUG in db: DashboardDB.select_attempts took 216544722 ns
Oct 31 05:53:14 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 05:53:14] "HEAD /dashboard/attempts HTTP/1.1" 200 -

These response times, and the server’s memory usage, also increase greatly under contention:

$ cat a
#!/bin/sh
curl -Io /dev/null https://staging.intermittent-tracker.servo.org/dashboard/attempts

$ chmod +x a
$ yes | xargs -P 16 ./a
Oct 31 06:01:26 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:26,461] DEBUG in db: DashboardDB.select_attempts took 25573638908 ns
Oct 31 06:01:26 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:26] "HEAD /dashboard/attempts HTTP/1.1" 200 -
Oct 31 06:01:28 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:28,810] DEBUG in db: DashboardDB.select_attempts took 27892397139 ns
Oct 31 06:01:29 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:29] "HEAD /dashboard/attempts HTTP/1.1" 200 -
Oct 31 06:01:29 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:29,187] DEBUG in db: DashboardDB.select_attempts took 28279068053 ns
Oct 31 06:01:29 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:29] "HEAD /dashboard/attempts HTTP/1.1" 200 -
[…]
Oct 31 06:01:33 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:33,668] DEBUG in db: DashboardDB.select_attempts took 32682824065 ns
Oct 31 06:01:33 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:33] "HEAD /dashboard/attempts HTTP/1.1" 200 -
Oct 31 06:01:34 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:34,103] DEBUG in db: DashboardDB.select_attempts took 33074041413 ns
Oct 31 06:01:34 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:34] "HEAD /dashboard/attempts HTTP/1.1" 200 -
Oct 31 06:01:34 ci0 intermittent-tracker-staging-start[3257485]: [2024-10-31 06:01:34,555] DEBUG in db: DashboardDB.select_attempts took 33596522893 ns
Oct 31 06:01:34 ci0 intermittent-tracker-staging-start[3257485]: 127.0.0.1 - - [31/Oct/2024 06:01:34] "HEAD /dashboard/attempts HTTP/1.1" 200 -
delan commented 1 week ago

To fix this, we need to avoid sending the whole database to the client, even if there are no filters. This is tricky, because the dashboard frontend currently relies on being aware of all data matching the given filters, in order to do the “all results have the same” analysis correctly.

Why? After all, being able to see all historical data from the web UI is nice to have, but it wasn’t a design goal. But the dashboard was designed to support all of the following at the same time:

These requirements may not be set in stone though, and that could affect the solution. For example, maybe we don’t always need live updates, or maybe we don’t need them at all. Either way, I think we need to move the “all results have the same” analysis to the server, and limit the number of results ever returned to the client. If you want to see really old data, dig into the database yourself.

mrobinson commented 1 week ago

I think that the default view should either show what most-flaky or least-flaky shows, but clamping the number of results shown. These views are quite useful:

  1. most-flaky is useful to know if there is a test that is flaking a lot -- to detect what the highest value flaky tests to fix are.
  2. least-flaky is useful to know if a test has stopped flaking and we can close the intermittent bug for it.
mrobinson commented 1 week ago

Regarding live updates, I think we almost never need live updates. The flakiness results are incredibly noisy, so a delta of data from a run that happens isn't so useful.