Open mollylogue opened 2 years ago
@mollylogue and I have been deep diving this issue this week and here are our findings so far:
graphql?FetchCommits
API callgit log
command that fetches 100
commits
git
command is similar to git log --format=format:%H%x00%aN%x00%aE%x00%at%x00%cN%x00%cE%x00%ct%x00%B%x00%P%x00 -n 101 481e785f932059def1500224b6c2384baedf9842 -- git/fsck.c
query FetchCommits($repo: ID!, $revision: String!, $first: Int, $currentPath: String, $query: String) {
node(id: $repo) {
__typename
... on Repository {
commit(rev: $revision) {
ancestors(first: $first, query: $query, path: $currentPath) {
...CommitAncestorsConnectionFields
}
}
}
}
}
fragment GitCommitFields on GitCommit {
id
oid
abbreviatedOID
message
subject
body
author {
...SignatureFields
}
committer {
...SignatureFields
}
parents {
oid
abbreviatedOID
url
}
url
canonicalURL
externalURLs {
...ExternalLinkFields
}
tree(path: "") {
canonicalURL
}
}
fragment SignatureFields on Signature {
person {
avatarURL
name
email
displayName
user {
id
username
url
displayName
}
}
date
}
fragment ExternalLinkFields on ExternalLink {
url
serviceKind
}
fragment CommitAncestorsConnectionFields on GitCommitConnection {
nodes {
...GitCommitFields
}
pageInfo {
hasNextPage
}
}
{
"currentPath": "git/fsck.c",
"first": 100,
"query": "",
"repo": "UmVwb3NpdG9yeToyNjExMg==",
"revision": ""
}
Note: The value of repo
depends on the repo ID, and in this case is the value of the gigarepo
on the scaletesting
instance.
The request with "first": 100
times out. But using ""first": 1"
takes about ~42 seconds. And a value of 2
takes a similar amount of time. A value of 50
takes about 47 seconds while using 70
breaches the 50 second mark and 75
eventually leads to a timeout (> 60 seconds).
API console with query on scaletesting
What's interesting is that the gigarepo
on dogfood does not seem to suffer from the same problem. The history tab loads although it takes a while (expectedly).
gigarepo on Dogfood with history tab
Tracing this leads to the findings noted above about the slowest part being the git log
command: Trace URL
At the moment tracing is not available on scaletesting, and I've attempted to configure it but didn't get far. If we can get that working, we can trace the request from the API console with a low value of first
to verify if the trace on dogfood is similar to that on scaletesting or not.
At the moment, the question begs, what's different between the two repos on each of the instances? And what's different between the infrastructure setup of the two instances?
On dogfood:
bash-5.1$ git rev-list main --count
2025373
On scaletesting:
bash-5.1$ git rev-list master --count
1889056
Surprisingly, the dogfood replica has more commits than the one on scaletesting. Also the default branch is different. Is that enough for further investigation?
Both the instancs have the following CPU and memory limits on their respectie pods (gitserver-3
):
Limits:
cpu: 8
memory: 16G
Requests:
cpu: 8
memory: 16G
Now that we can reliably reproduce a slow request from the API console instead of it timing out, we can trace the API request with a lower value of first
with the customer to verify if their issue is also with the slow git command. Although this is already highly probable.
As I ran the raw git command directly on the gitserver-3
pod of the dogfood instance, I noticed that the same command took ~8 seconds to completed while the request from the API console had taked ~16 seconds. When I went back to the UI to verify this, I noticed that the request from the browser now also took ~8 seconds. I tried this with another path and had the same result.
Tried it after a while and the request now took ~21 seconds. I would imagine there's some form of caching in play here but my git-fu is not that strong to be able to state this assertively.
I noticed that the same command took ~8 seconds to completed while the request from the API console had taked ~16 seconds. When I went back to the UI to verify this, I noticed that the request from the browser now also took ~8 seconds. I tried this with another path and had the same result.
This makes me wonder whether it isn't filesystem-level caching behaviour you're seeing here. I'd suggest to try to dig into this and find out what caching behaviour it is that speeds up the command (can strace
, blktrace
, ftrace
, iostat
, iotop
, etc. help here? see the diagrams on this page for more ideas)
Request 50 commits in the History tab instead of 100
Does the runtime scale linearly with number of commits, for different files? Assuming it is caching behaviour that we're seeing, it'd be interesting to see whether 50 commits is consistently slower than 100 commits, when looking at, say, 100 random files in the repository (get a full list of files, shuffle them a bunch of times, take out 100, run the test for each file)
(Edit: just to make it clear: happy to help with all of that! :))
We've recently added a setting to set a custom page size for the history tab which helps as an intermittent workaround.
And I am currently exploring a way to add pagination on the GitCommitsConnection
.
Follow-up ticket here: https://github.com/sourcegraph/sourcegraph/issues/44993
On a very large repo with a lot of history (see the gigarepo w/ nearly 2 million commits) the history tab for a file times out and returns a 504.
TODO:
/cc @jplahn @ryphil