isomerpages / isomercms-backend

A static website builder and host for the Singapore Government
5 stars 1 forks source link

feat: optimise local diff #1334

Closed jacobkwan closed 5 months ago

jacobkwan commented 5 months ago

Problem

Closes https://linear.app/ogp/issue/ISOM-833/benchmark-impact-of-local-git-diff-on-cpu-usage Local diff was implemented a while back but kept behind a feature flag. Feature flag has not been enabled in production due to concerns about performance especially for larger diffs.

Solution

Understanding bottlenecks via Datadog

Everything below is in the context of motac2 with ~444 files changed in one RR Trace for unoptimised implementation Significant bottlenecks (in order of priority):

  1. getFilesChanged 27s / 40s or ~66%
  2. calling getLatestCommitOfPath for every single file changed
    • called in parallel but still each call is about 10s / 40s or 25%
  3. querying Users for every commit to get email (@timotheeg had already fixed this for the GitHub diff flow), actually doesn't add much to latency

Optimisations

  1. Refactor getFilesChanged to call git diff-tree -r --name-only master..staging
    • this change alone cuts the 27s to ~70ms
  2. Refactor the entire flow. Like @timotheeg mentioned, writing "clean" and "elegant" code in the original implementation - mapping over the filenames and performing the operations on that individual level - is not always the best when it comes to efficiency. Additionally, there is extra network cost from interacting with EFS many times.
    • instead getCommitsBetweenMasterAndStaging once
    • loop over these commits, constructing a filenameToLogMap. At the same time, get unique userIds from commit messages to dedupe queries
    • query DB for the unique users and construct a userIdToUserMap
    • finally use the two maps to construct the full response for each filename - minimal changes here just that it no longer needs to query DB and get latest commit from EFS for every single filename, as the two data structures are sufficient.

Final flame graph after optimisations:

Bottleneck is now the many parallel reads on EFS. Out of scope for this PR and its also current behaviour for GitHub flow. TLDR: reads are needed to get the permalink from frontmatter, which is then used to construct the stagingUrl

https://github.com/isomerpages/isomercms-backend/assets/77665182/44d530aa-da0c-4a94-88a1-b24978d13cdf

Breaking Changes

Tests

Copied over tests from https://github.com/isomerpages/isomercms-backend/pull/1172

linear[bot] commented 5 months ago

ISOM-833 Benchmark impact of local Git diff on CPU usage