History tab times out at scale

mollylogue commented 2 years ago

On a very large repo with a lot of history (see the gigarepo w/ nearly 2 million commits) the history tab for a file times out and returns a 504.

TODO:

[x] Determine which portion of the request is taking the most time
[ ] Identify ways to optimize this to avoid timeouts.

/cc @jplahn @ryphil

indradhanush commented 2 years ago

@mollylogue and I have been deep diving this issue this week and here are our findings so far:

The top level root cause is the graphql?FetchCommits API call
The request takes most of the time in the git log command that fetches 100 commits
- The exact git command is similar to git log --format=format:%H%x00%aN%x00%aE%x00%at%x00%cN%x00%cE%x00%ct%x00%B%x00%P%x00 -n 101 481e785f932059def1500224b6c2384baedf9842 -- git/fsck.c
The exact GQL query being run is the following and can be tried in the API console:

Query

query FetchCommits($repo: ID!, $revision: String!, $first: Int, $currentPath: String, $query: String) {
  node(id: $repo) {
    __typename
    ... on Repository {
      commit(rev: $revision) {
        ancestors(first: $first, query: $query, path: $currentPath) {
          ...CommitAncestorsConnectionFields
        }
      }
    }
  }
}

fragment GitCommitFields on GitCommit {
  id
  oid
  abbreviatedOID
  message
  subject
  body
  author {
    ...SignatureFields
  }
  committer {
    ...SignatureFields
  }
  parents {
    oid
    abbreviatedOID
    url
  }
  url
  canonicalURL
  externalURLs {
    ...ExternalLinkFields
  }
  tree(path: "") {
    canonicalURL
  }
}

fragment SignatureFields on Signature {
  person {
    avatarURL
    name
    email
    displayName
    user {
      id
      username
      url
      displayName
    }
  }
  date
}

fragment ExternalLinkFields on ExternalLink {
  url
  serviceKind
}

fragment CommitAncestorsConnectionFields on GitCommitConnection {
  nodes {
    ...GitCommitFields
  }
  pageInfo {
    hasNextPage
  }
}

Variables

{
  "currentPath": "git/fsck.c",
  "first": 100,
  "query": "",
  "repo": "UmVwb3NpdG9yeToyNjExMg==",
  "revision": ""
}

Note: The value of repo depends on the repo ID, and in this case is the value of the gigarepo on the scaletesting instance.

The request with "first": 100 times out. But using ""first": 1" takes about ~42 seconds. And a value of 2 takes a similar amount of time. A value of 50 takes about 47 seconds while using 70 breaches the 50 second mark and 75 eventually leads to a timeout (> 60 seconds).

API console with query on scaletesting

Dogfood

What's interesting is that the gigarepo on dogfood does not seem to suffer from the same problem. The history tab loads although it takes a while (expectedly).

gigarepo on Dogfood with history tab

Tracing this leads to the findings noted above about the slowest part being the git log command: Trace URL

Blockers

At the moment tracing is not available on scaletesting, and I've attempted to configure it but didn't get far. If we can get that working, we can trace the request from the API console with a low value of first to verify if the trace on dogfood is similar to that on scaletesting or not.

Dogfood vs Scaletesting

At the moment, the question begs, what's different between the two repos on each of the instances? And what's different between the infrastructure setup of the two instances?

Repo

On dogfood:

bash-5.1$ git rev-list main --count
2025373

On scaletesting:

bash-5.1$ git rev-list master --count
1889056

Surprisingly, the dogfood replica has more commits than the one on scaletesting. Also the default branch is different. Is that enough for further investigation?

CPU

Both the instancs have the following CPU and memory limits on their respectie pods (gitserver-3):

    Limits:
      cpu:     8
      memory:  16G
    Requests:
      cpu:     8
      memory:  16G

Suggested debugging with the affected customer

Now that we can reliably reproduce a slow request from the API console instead of it timing out, we can trace the API request with a lower value of first with the customer to verify if their issue is also with the slow git command. Although this is already highly probable.

Other findings

As I ran the raw git command directly on the gitserver-3 pod of the dogfood instance, I noticed that the same command took ~8 seconds to completed while the request from the API console had taked ~16 seconds. When I went back to the UI to verify this, I noticed that the request from the browser now also took ~8 seconds. I tried this with another path and had the same result.

Tried it after a while and the request now took ~21 seconds. I would imagine there's some form of caching in play here but my git-fu is not that strong to be able to state this assertively.

Suggested short term fix

Request 50 commits in the History tab instead of 100
Make a second request to fetch the next 50 if the user scrolls past a certain point
Repeat for loading history all the way back to the start of the big bang
Optionally, ignore the above and configure this as a flag for specific repos to avoid a change for everyone

Suggested long term fix

Deep dive git internals and find ways to optimize this for monorepos (but git wasn't built for monorepos right?)
Talk to other git masters in the team

mrnugget commented 2 years ago

I noticed that the same command took ~8 seconds to completed while the request from the API console had taked ~16 seconds. When I went back to the UI to verify this, I noticed that the request from the browser now also took ~8 seconds. I tried this with another path and had the same result.

This makes me wonder whether it isn't filesystem-level caching behaviour you're seeing here. I'd suggest to try to dig into this and find out what caching behaviour it is that speeds up the command (can strace, blktrace, ftrace, iostat, iotop, etc. help here? see the diagrams on this page for more ideas)

Request 50 commits in the History tab instead of 100

Does the runtime scale linearly with number of commits, for different files? Assuming it is caching behaviour that we're seeing, it'd be interesting to see whether 50 commits is consistently slower than 100 commits, when looking at, say, 100 random files in the repository (get a full list of files, shuffle them a bunch of times, take out 100, run the test for each file)

(Edit: just to make it clear: happy to help with all of that! :))

indradhanush commented 1 year ago

We've recently added a setting to set a custom page size for the history tab which helps as an intermittent workaround.

And I am currently exploring a way to add pagination on the GitCommitsConnection.

mrnugget commented 1 year ago

Follow-up ticket here: https://github.com/sourcegraph/sourcegraph/issues/44993

sourcegraph / sourcegraph-public-snapshot