repology / repology-rs

Repology rewrite in Rust
GNU General Public License v3.0
16 stars 0 forks source link

Optimize repository history #120

Open AMDmi3 opened 4 hours ago

AMDmi3 commented 4 hours ago

Repositories history (the one used for graphs) is currenty stored as timestamped json of all per-repository counters, like this:

{
  "t2": {
    "num_problems": 1799,
    "num_maintainers": 74,
    "num_metapackages": 5398,
    "num_metapackages_newest": 2861,
    "num_metapackages_unique": 473,
    "num_metapackages_outdated": 1761,
    "num_metapackages_comparable": 4668,
    "num_metapackages_vulnerable": 194,
    "num_metapackages_problematic": 269
  },
  "aur": {
    "num_problems": 7578,
    "num_maintainers": 14430,
    "num_metapackages": 75715,
    "num_metapackages_newest": 23697,
    "num_metapackages_unique": 36915,
    "num_metapackages_outdated": 8363,
    "num_metapackages_comparable": 32114,
    "num_metapackages_vulnerable": 382,
    "num_metapackages_problematic": 276
  },
  "mpr": {
    "num_problems": 36,
    "num_maintainers": 110,
    "num_metapackages": 713,
    "num_metapackages_newest": 173,
    "num_metapackages_unique": 125,
    "num_metapackages_outdated": 309,
    "num_metapackages_comparable": 491,
    "num_metapackages_vulnerable": 31,
    "num_metapackages_problematic": 12
  },
  "pld": {
    "num_problems": 0,
    "num_maintainers": 0,
    "num_metapackages": 12567,
    "num_metapackages_newest": 5258,
    "num_metapackages_unique": 1595,
    ...

At a first glance, it looks like JSON overhead here is tremendous and converting the counters into classic SQL columns, also splitting these into per-repository entries would be much more optimal. In practice that's not the case, and doing so produces table of more than 2x size (988 MB jsonb, 2316 MB columns), not counting index sizes. I assume that jsonb variant benefits from TOAST compression and from lack of postgresql row overhead (with 381 active repositories, that's around 9kb per history point).

However, the conversion allows to handle each repository independently, which makes lookups faster (no need to fetch counters for 400 repositories for only one of them), and also allows to drop history points which did not have any counter changes. The latter allows to shrink the table to 387 MB (+145 MB index), at the cost of need for periodic cleanup, or more complex history update logic (if previous history point is the same, update its timestamp instead of adding a new one), which I'd like to avoid to implement with current sql-based updater.

Some more tests are needed, but it looks like the columns variant is more viable. At the very least, we should pick one as currently both histories are stored which takes precious disk space.

Migration query:

TRUNCATE repositories_history_new; 

WITH translated AS (
    SELECT
        (SELECT id FROM repositories WHERE name = key) AS repository_id,
        ts,
        (value->>'num_problems')::integer AS num_problems,
        (value->>'num_maintainers')::integer AS num_maintainers,
        (value->>'num_metapackages')::integer AS num_projects,
        (value->>'num_metapackages_unique')::integer AS num_projects_unique,
        (value->>'num_metapackages_newest')::integer AS num_projects_newest,
        (value->>'num_metapackages_outdated')::integer AS num_projects_outdated,
        (value->>'num_metapackages_comparable')::integer AS num_projects_comparable,
        (value->>'num_metapackages_problematic')::integer AS num_projects_problematic,
        (value->>'num_metapackages_vulnerable')::integer AS num_projects_vulnerable
    FROM repositories_history, jsonb_each(snapshot) as j(key, value)
), extended AS (
    select
        *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM translated
    WHERE repository_id IS NOT NULL
)
INSERT INTO repositories_history_new
SELECT repository_id, ts, num_problems, num_maintainers, num_projects, num_projects_unique, num_projects_newest, num_projects_outdated, num_projects_comparable, num_projects_problematic, num_projects_vulnerable
FROM extended
WHERE num_problems IS DISTINCT FROM prev_num_problems
    OR num_maintainers IS DISTINCT FROM prev_num_maintainers
    OR num_projects IS DISTINCT FROM prev_num_projects
    OR num_projects_unique IS DISTINCT FROM prev_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM prev_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM prev_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM prev_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM prev_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM prev_num_projects_vulnerable

    OR num_problems IS DISTINCT FROM next_num_problems
    OR num_maintainers IS DISTINCT FROM next_num_maintainers
    OR num_projects IS DISTINCT FROM next_num_projects
    OR num_projects_unique IS DISTINCT FROM next_num_projects_unique
    OR num_projects_newest IS DISTINCT FROM next_num_projects_newest
    OR num_projects_outdated IS DISTINCT FROM next_num_projects_outdated
    OR num_projects_comparable IS DISTINCT FROM next_num_projects_comparable
    OR num_projects_problematic IS DISTINCT FROM next_num_projects_problematic
    OR num_projects_vulnerable IS DISTINCT FROM next_num_projects_vulnerable
;

Cleanup query:

WITH extended AS (
    SELECT *,
        lag(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_problems,
        lag(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_maintainers,
        lag(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects,
        lag(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_unique,
        lag(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_newest,
        lag(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_outdated,
        lag(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_comparable,
        lag(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_problematic,
        lag(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS prev_num_projects_vulnerable,

        lead(num_problems) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_problems,
        lead(num_maintainers) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_maintainers,
        lead(num_projects) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects,
        lead(num_projects_unique) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_unique,
        lead(num_projects_newest) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_newest,
        lead(num_projects_outdated) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_outdated,
        lead(num_projects_comparable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_comparable,
        lead(num_projects_problematic) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_problematic,
        lead(num_projects_vulnerable) OVER (PARTITION BY repository_id ORDER BY ts) AS next_num_projects_vulnerable
    FROM repositories_history_new
    --WHERE repository_id = 113
    --WHERE ts > now() - interval '1 day'
    ORDER BY ts
), duplicates AS (
    SELECT repository_id, ts
    FROM extended
    WHERE num_problems IS NOT DISTINCT FROM prev_num_problems
        AND num_maintainers IS NOT DISTINCT FROM prev_num_maintainers
        AND num_projects IS NOT DISTINCT FROM prev_num_projects
        AND num_projects_unique IS NOT DISTINCT FROM prev_num_projects_unique
        AND num_projects_newest IS NOT DISTINCT FROM prev_num_projects_newest
        AND num_projects_outdated IS NOT DISTINCT FROM prev_num_projects_outdated
        AND num_projects_comparable IS NOT DISTINCT FROM prev_num_projects_comparable
        AND num_projects_problematic IS NOT DISTINCT FROM prev_num_projects_problematic
        AND num_projects_vulnerable IS NOT DISTINCT FROM prev_num_projects_vulnerable

        AND num_problems IS NOT DISTINCT FROM next_num_problems
        AND num_maintainers IS NOT DISTINCT FROM next_num_maintainers
        AND num_projects IS NOT DISTINCT FROM next_num_projects
        AND num_projects_unique IS NOT DISTINCT FROM next_num_projects_unique
        AND num_projects_newest IS NOT DISTINCT FROM next_num_projects_newest
        AND num_projects_outdated IS NOT DISTINCT FROM next_num_projects_outdated
        AND num_projects_comparable IS NOT DISTINCT FROM next_num_projects_comparable
        AND num_projects_problematic IS NOT DISTINCT FROM next_num_projects_problematic
        AND num_projects_vulnerable IS NOT DISTINCT FROM next_num_projects_vulnerable
)
DELETE FROM repositories_history_new
USING duplicates
WHERE repositories_history_new.repository_id = duplicates.repository_id
    AND repositories_history_new.ts = duplicates.ts;

Migration without deduplication then cleanup produces the same result as deduplicating migration. However cleanup on 2GB table takes very long time, so it may be faster if it's limited with repository. Cleanup on mostly deduplicated table is fast, and is even faster if limited with age, so it's safe to run periodically or after the update.

Additional thing: for this to work, SQL queries for /graph/repo/* endpoints should be tweaked to fetch additional row just before desired range of 21 day, as the history is now sparse and the graph may now only contain one recent data point, with which it's not possible to plot any lines, so we need an extra point (in fact, we need it always not to have a gap at the beginning of the graph).

AMDmi3 commented 2 hours ago

Note: it turned out we already have deduplication of fresh entries in repositories_history_new.

AMDmi3 commented 2 hours ago

We also don't really want to store historical data on per-update granularity - it also makes sense to leave ~one point per day for history older than a year.

AMDmi3 commented 2 hours ago

And we can do the same for total statistics. Not that it would conserve much space, but for consistency at the very least.