Slow repository browsing in 1.14.x #15707

Closed tsowa closed 3 years ago

Gitea version (or commit ref): 1.14.x
Git version: 2.31.1
Operating system: FreeBSD 13
Gitea built using ports collection (www/gitea)
Gitea started by startup script provided by www/gitea port
Database (use [x]):
- [x] PostgreSQL 12.6
- [ ] MySQL
- [ ] MSSQL
- [ ] SQLite
Can you reproduce the bug at https://try.gitea.io:
- [x] Yes (https://try.gitea.io/tsowa/FreeBSD_ports)
- [ ] No
Log gist: https://www.ttmath.org/gitea.log

Description

I saw a similar thread but there is "windows" in the title so I create a new issue. Gitea 1.14.x is much slower in repository browsing than Gitea 1.13.

Sample repo running with 1.14.1: https://gitea.ttmath.org/FreeBSD/ports Try to open any directory, for example: https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio It takes between 50-150 seconds to open a page.

The same repo running with 1.13.7: https://giteaold.ttmath.org/FreeBSD/ports Try to open similar directory, for example: https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/audio I takes about 5 seconds.

You can see the same problem on try.gitea.io: https://try.gitea.io/tsowa/FreeBSD_ports But you have a cache so you have to find a directory which was not open before. Opening such a page takes 100-300 seconds.

Let me know if more info is needed.

This is because the algorithm was changed in 1.14 due to a problem with go-git causing significant memory issues. Thank you for the test cases though because they will provide tests to improve the current algorithm.

If you are suffering significant slow downs here you can switch back to the gogit build by adding gogit to your TAGS during building.

We would otherwise appreciate help in improving the performance of the algorithm for the pure git version.

Thanks for the hint with TAGS. I don't have time to make more tests now but I found something interesting.

When browsing my repository with gitea I see in htop following git processes:

22304 root       20   0 12876  2100 S  0.0  0.0  0:00.00 daemon: /usr/bin/env[22305]
22305 git2       31   0  926M  254M S 136.  0.8  1:29.80 └─ /usr/local/sbin/gitea web
22839 git2       21   0  952M  158M S  3.3  0.5  0:01.11    ├─ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad
22840 git2       27   0 1103M  275M S 13.5  0.8  0:04.59    └─ /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch

These processes were running for about one minute so I have run the first git process by hand:

$ cd /var/db/gitea2/gitea-repositories/freebsd/ports.git
$ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad | wc -l

and it gave me 1087346 rows. I suppose the millions rows are then pass to the second git process.

I have piped output from the first git to the other:

$ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad | /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch > swinka.txt

it takes about 15 seconds and shows that file swinka.txt is larger than 1 GB

$ ll -h swinka.txt 
-rw-r--r--  1 git2  git2   1,4G 10 maj 22:47 swinka.txt

so there is a lot of data to pass between gitea and git. So the question is: is it really needed for the first git process to return one milion rows?

@tsowa unfortunately yes but it should be relatively fast - the issue will be that the structure of some repos will actually require that million of rows to be checked more than a few times. Determining which commit a file is related to is not a simple task in git - and although there's a commit graph we don't have a good way of querying it.

(It shouldn't take 15s to pipe those two commands together - you're slowing things down by allocating file space - you should pipe the output to null btw.)

There are a few more improvements to that function that can be made - for a start the function is not optimised for our collapsing of of directories containing a single document - and writing a commit graph reader would be part of that.

The gogit backend does have a commitgraph reader but it is not frugal with memory at all. I need to spend some time making a reader that is much more frugal and stream like but I haven't had the time. (See the technical docs https://github.com/git/git/blob/master/Documentation/technical/commit-graph.txt)

In the end though we need to move rendering of last commit info out of repo browsing and in to an ajax call. Again something I haven't had time to do.

One question - have you disabled the commit cache? If so please re-enable it.

It was enabled by default but the 'adapter' option was set to 'memory'. Now I have installed memcached and changed adapter to 'memcache' and a difference is visible.

Opening https://gitea.ttmath.org/FreeBSD/ports for the first time took 79766ms and for the second time only 3063ms. Opening https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio for the first time 141221ms and later 37205ms.

But I see that you are calling a lot of git processes, I have created a small git wrapper in such a way:

#include <unistd.h>
#include <fstream>
#include <iostream>

int main(int argc, char * argv[], char * envp[])
{
    std::ofstream file("/home/tomek/git.log", std::ios_base::out | std::ios_base::app);

    if( file )
    {
        file << "git ";

        for(size_t i=0 ; argv[i] ; ++i)
        {
            file << argv[i] << " ";
        }

        file << std::endl;
        file.close();
    }

    return execve("/usr/local/bin/git.org", argv, envp);
}

I have moved original /usr/local/bin/git to /usr/local/bin/git.org and have compiled above program as /usr/local/bin/git. And it gives me git.log with all git operations and I see that sometimes you are calling the git binary 300 times in one request:

cat ~/git.log | wc -l
     335

So it cannot be fast, this remains me of the old days when we were using cgi scripts. Is there a reason you are calling git directly instead of using a git library such as libgit2?

Could you also count what's the git command Gitea invoked in these 335 commonds?

This is because when browsing, Gitea will get last commit message for every dir/file on the ui. For v1.13.0, we use go-git which is a pure go git library, for v1.14.x, we have two versions for windows because the library have some memory problems. And maybe you could try to compile the go git version yourself to check what's the different between them.

Hello there, looking at this from Codeberg's perspective (issue)

As you can see, we're also suffering from the slow repository browsing which affects the overall performance of our machine. While setting up a Redis cache works well for us, we would like to improve the initial generation on cache misses, too. Today, we tried to find some more information about the bottleneck, I hope this is useful for you:

We suspect especially this command where each folder is checked for the latest commit, executing /usr/bin/git -c credential.helper= -c protocol.version=2 -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= rev-list --format=%T <commit> which has a terrible performance (loading the entire commit history).

The idea makes sense to us: getting all commits, checking if the folder or file was touched. But the logic without gogit doesn't appear to stop after all necessary information was loaded, but rather continue serving up the entire history, even if all files or subfolders in a folder have already been hit by a recent commit. While we didn't completely understand the gogit logic yet, it appears to be a little smarter at this point and only looking so far back in history as necessary to retrieve the information it's looking for.

We assume that the process should be stopped before it ends if all information was provided.

It looks like there's a lot of stuff to be improved with the native git backend, and some actions will probably always be slower because they cannot directly interface with git operations (e. g. directly working on git results while they are fetched instead of the piped input). It might a good idea to turn back to gogit for all systems when the memory issues are resolved, or, look for another git library (that is more native and maybe faster than gogit), but offers some better interface (go-bindings for libgit2?)

Please let us know if we can provide further assistance in improving this performance issue.

Some other random observations that might be interesting to you:

multiple requests for the same resource try to generate it concurrently on cache miss, the operation doesn't get queued (thus, requesting the same page twice results in generating it twice until the cache is filled)
git processes aren't stopped when the requesting TCP connection is closed
- if a connection times out (proxy), the process is still running in the backend
- a user reloading the page easily doubles the resources for this operation
- it's possible to DoS a huge server by simply spamming F5 (reload) on a page with a cache miss with minimum cost at the attackers side (no need to keep the connection open)
initial generation with this method will always be very slow if some files (e.g. a README , LICENCE, .gitignore, gitattributes, dockerignore etc) weren't touched for a long time
pushing to a repo invalidates the full redis cache, even if only parts of the information changed (e.g. a subfolder was updated), thus active repos won't profit very much from the cache
timed out git commands (by adjusting the timeout in Gitea) keeps them listed as running in the Gitea admin monitoring section, although the processes don't exist in the system any more

Take a look at #15891

@fnetX thanks for your long comment.

It's worth remembering that the issue precipitating the pure git backend was memory use. I've submitted a patch to go-git which should cause much lower memory load. Until that is in and working correctly, go-git will happily load in huge blobs in to memory - storing them in caches even when you want to check the size of the object. It's really worth being clear that that is an absolutely intolerable situation.

Further the issues you are highlighting in the last section are not new to the native git backend. They're present in a different way in the go-git backend just in a way you can't track. I've long advocated for changing to a more gitlab approach for this and/or for passing in request contexts to terminate things - I'm really happy to work on this - but I haven't had a chance to do this - and to be honest none of you are paying me.

Now going on to the get last commit algorithm.

All algorithms have a balance between memory and time. The current algorithm is highly optimised against memory use. If we are happy to use more memory that can be improved.

We suspect especially this command where each folder is checked for the latest commit, executing /usr/bin/git -c credential.helper= -c protocol.version=2 -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= rev-list --format=%T <commit> which has a terrible performance (loading the entire commit history).

The idea makes sense to us: getting all commits, checking if the folder or file was touched. But the logic without gogit doesn't appear to stop after all necessary information was loaded, but rather continue serving up the entire history, even if all files or subfolders in a folder have already been hit by a recent commit. While we didn't completely understand the gogit logic yet, it appears to be a little smarter at this point and only looking so far back in history as necessary to retrieve the information it's looking for.

Looking at the length of time the rev-list process is running a bit of a distractor. Yes the go-git process can stop once it's finished looking at all the parents and the paths, but it's a question of memory and time spent tracking the parents. git rev-list avoided tracking those parents - and grabbing the root tree saved a lot of time - but we could add in %P to the format to allow tracking of parents and could allow termination once all appropriate parents are determined - but I don't think it's the primary cause of delays.

The greatest speed up in #15891 is actually preemptive passing the next tree ID to the git cat-file process as soon as we know what it's going to be. A large proportion of time appears to be spent waiting for go to fill the read buffer from the other process. This is where the go-git algorithm can be quicker as it avoids that by reading files in directly.

Some other random observations that might be interesting to you:

multiple requests for the same resource try to generate it concurrently on cache miss, the operation doesn't get queued (thus, requesting the same page twice results in generating it twice until the cache is filled)

git processes aren't stopped when the requesting TCP connection is closed

if a connection times out (proxy), the process is still running in the backend

a user reloading the page easily doubles the resources for this operation

it's possible to DoS a huge server by simply spamming F5 (reload) on a page with a cache miss with minimum cost at the attackers side (no need to keep the connection open)

initial generation with this method will always be very slow if some files (e.g. a README , LICENCE, .gitignore, gitattributes, dockerignore etc) weren't touched for a long time

pushing to a repo invalidates the full redis cache, even if only parts of the information changed (e.g. a subfolder was updated), thus active repos won't profit very much from the cache

timed out git commands (by adjusting the timeout in Gitea) keeps them listed as running in the Gitea admin monitoring section, although the processes don't exist in the system any more

These are all longstanding issues and I am aware of them. I would love to spend time fixing these but I am limited in my time and availability.

Honestly I wish you'd just talked to me directly. I'm always on Discord and could have told you and kept you abreast of what was going on and my progress in trying to speed this up.

oh my - I think I know how to seriously improve this. I think I've been way too distracted by the way it was done in the go-git implementation and there's genuinely a much quicker way to do this.

Hey, thank you very much for the explanation.

I've submitted a patch to go-git which should cause much lower memory load. Until that is in and working correctly ...

I somehow thought this was already in and just needed some further improvements, my bad.

and to be honest none of you are paying me.

Yes, we can mainly offer to being thankful as long as we aren't paid for anything either. :heart: But let's see if we can figure something out.

Honestly I wish you'd just talked to me directly. I'm always on Discord

Yeah, the others told me that, too. But since Discord is a proprietary app that kept crashing my computer back when I last used it, I decided against this and went for dumping our findings somewhere, hoping they are of any use. Chose this issue over the thread on Codeberg as it seemed to better fit in this topic here.

I think I know how to seriously improve this

Sounds like good news. Please tell us if there's anything we can do.

oh my - I think I know how to seriously improve this. I think I've been way too distracted by the way it was done in the go-git implementation and there's genuinely a much quicker way to do this.

Unfortunately this doesn't work.

The idea was to use git log --format=%H --raw -t --no-abbrev --reverse COMMIT_ID -- paths but I can't come up with a way to stop it from listing the contents of the trees - meaning that it takes even longer.

If I could figure out a way to not list the contents of the trees this would be definitely faster than the go-git version.

What about adding -n 1? Correct me if I am completely mistaken, because I neither fully understand the Gitea backend yet, nor do I know how git works, but this seems to give you the latest commit of a path and gives the same result as Gitea currently gives.

Oh, you probably still want to have the full list of the folder you're looking at, just not of all the subfolders?

yeah - I mean if we could just do that n times then it would be easy and fine but it's not like that.

Also it's not quite -n1 consider the following tree:

If the wibble becomes the object with SHA deadbeef at B and at E. The correct commit to report is B not E.

So -n1 is still not right. git describe will give the correct answer but it's too slow to be run n times.

Could you test #15891? In my limited testing this is faster for the root directory than the go-git native version. There is a still a slowdown problem in the subdirectories.

@fnetX - I've just made another improvement in #15891 which should solve the sub directories problem

Thank you. We haven't yet been able to properly backport it to our fork // rebase our patches to this pull. We'll look into it and test then.

@zeripath Thanks, now testing bd1455aa from your repo (cache is disabled): https://giteanew.ttmath.org/FreeBSD/ports

The speed up is visible, browsing directories is about 5 times faster than 1.14.x. Not as fast as cgit but much better than before. Good job.

@tsowa does cgit even attempt to provide last commit information?

I have a backport of the latest get-lastcommit-cache performance improvements on to 1.14 if people would like them.

We have tested the backport you provided on codeberg-test.org and it has ~x3 loadtimes as go-git (15 to 17 seconds your pull vs. ~ 5 seconds go-git). We're using git version 2.29.2 - do you know if a more recent version might have a better performance or if there are other constraints that might decrease performance? It's a single-core 2GB RAM VPS.

well that's interesting - as my timings appear to be similar to those on go-git.

are you sure you've built from the backport-improve-get-lastcommit branch?

The version should be 1.14.2+33-g57d45e1c2 as in SHA 57d45e1c247eaafb3a3a92ab593c31356b472d6f

Do you have commit graphs enabled for your repos?

We deployed this branch which has your commits on top of our 1.14 patches cleanly: https://codeberg.org/Codeberg/gitea/src/branch/codeberg-try-puregit-improvements (just confirmed once more that the commit matches: 1.14.2+49-g7e9e3f364)

Yes, you can browse commit graphs on Codeberg.

Hmm... I am very confused as this is now just as fast as gogit for me and possibly faster in places. Tell me there's at least some improvement here for you?

I'm almost at my limit for what I can do to speed this up any further. The main slowdowns in my testing were in filling the buffers between the pipes & adjusting when the subsequent reads occurred seemed to fix this for me - perhaps my processor is just fast enough that the earlier writes provide me just enough time to prevent the fill lock whereas on your processor that's not quite enough time. I just don't think there's any way to avoid it - I mean we could try an os.pipe instead of an io.pipe? I tried an nio.Pipe but it was just as slow. Certainly we can't switch to the same algorithm as the go-git variant as that would require even more communication and waiting for the cat-file-batch pipes to respond and fill.

By commit graph I meant the git core.commitGraph functionality. It should be enabled by default but ... I've certainly seen repos on my system that don't have a commit graph even though they would clearly benefit.

Ok I guess we're at a point of diminishing returns - and I might be better off looking at solving the problems in go-git and making last commit info stop slowing down rendering.

@zeripath

I did tests with the linux kernel and nixpkgs repos after a gitea restart with cold cache. It seems to be factor 3 on both in favor of gogit. I have no idea why the optimizations do nothing on those repos. git(new) here is the backport of your optimizations.

linux 1.14 gogit 0:21 linux 1.14 git 1:03 linux 1.14 git(new) 1:01

nixpkgs 1.14 gogit 0:06 nixpkgs 1.14 git 0:17 nixpkgs 1.14 git(new) 0:17

I also backported your optimization to 1.14 before with same results, but I blamed by lack of understanding of the code and a bad backport.

@ashimokawa you'd need to backport a few other PRs to see the improvement - it's not just #15891 that is needed. I'm happy to give you a link to that backport.

gogit has a commitgraph optimization, ref #7314, but of course, git version should also read commitgraph(#7313) if that file has been updated.

If we want to continue the development from gogit, maybe we can maintain a fork in gitea's orgnization if the original cannot merge the PR quickly.

@ashimokawa could you double check that these repos actually have commit graphs? The basic graph would be in .git/objects/info/commit-graph If they don't git commit-graph write will write one.

I suspect we might need to do more in gitea to forcibly write these graphs. (They're of benefit even when/if we switch back to gogit.)

Yes, we can update the commit-graph file once push for big repositories. And we also need to use modules/git/repo_commitgraph.go when build tag is non go-git which now is only a build tag of go-git.

Update: But we also have last commit cache when pushing a new commit to default branch. A commit graph generating should be invoked before cache.

@lunny It looks like I might need to be spending more time deep in go-git to come up with ways to improve it and/or avoid memory issues there. For example even if the repo level interface can't help but read objects totally into memory when stating them - we could crawl packs instead of repos to read headers to find sizes of objects at least allowing us to sidestep that issue.

Similarly at least adding in some control about the size of the caches could be useful.

(It'll need help to prep it for sha256 anyway I suspect.)

A hybrid approach of just using gogit for this last commit stuff might be reasonable too. That has less of the memory worries - although it's still potentially considerable. Every commit and its full tree (but fortunately not the blobs in this case) that are crawled is stored completely in the unlimited cache.

At the Gitea level I think we need to make rendering less dependent on getting these results - e.g. gitlab's deferred results here. So I'm gonna look at that next. The step of passing in a cancelable context is the first step for this - the next step is making an unique queue for things.

In terms of the current native approach - although I had a go commit graph reader written and setup - but on my testing the slow down was still at getting data back from the cat-file --batch. Again pointing to issues filling the buffers. It could be that os.pipe is better here or that some approach without a bufio.Reader is needed. (I'll look at contributing the split graph stuff back up to gogit though.)

To be honest though I think I'm hitting a bit of wall here and I might be at the limit of what can be done using native git.

Commit graph:

root@codeberg-test:/data/git/gitea-repositories/bigrepos# ls nixpkgs.git/objects/info/
commit-graph  packs         
root@codeberg-test:/data/git/gitea-repositories/bigrepos# ls linux.git/objects/info/
root@codeberg-test:/data/git/gitea-repositories/bigrepos# ls freebsd-src.git/objects/info/
packs
root@codeberg-test:/data/git/gitea-repositories/bigrepos#

present for nixpkgs, but not for the other two testing repos. But it doesn't seem to do much good here, or at least not noticable.

you'd need to backport a few other PRs to see the improvement

I took the branch you sent me, pushed it to codeberg and ashimokawa cherry-picked the last four commits onto our branch as the rest appeared to be plain Gitea 1.14, hope this was okay? We can test exactly your branch without codeberg patches once more, but we didn't touch the backend (at least not knowingly).

Have you enable last_commit_cache? I think that will resolve all non-first visit problems.

I sent a PR to add a new cache provider ledis which is a local disk cache. That means it will not use much memory but still accelerate non-first homepage visit speed for big repositories. ref #16035.

The PR is another aspect of effort and will not affect the above discussion.

Have you enable last_commit_cache?

Sometimes disabled in the experiments, but generally yes. Also we are looking into rolling out Redis on prod. We are still interested in improving the initial generation of the last commits, since even with enabled memory caching the generation slowed down the whole instance and left it unresponsive at times. And since cache is purged after every push and also not present for all the subfolders, branches etc, it sounds important to us to speed up it's generation.

@zeripath

This is what I used for the test, four commits added by you on 1.14 (codeberg branch)

https://codeberg.org/Codeberg/gitea/src/branch/codeberg-try-puregit-improvements

If there is more to cherry-pick, I am happy to re-run the "benchmarks" ASAP.

Ok thinking on I think the only answer is just to use repeated calls to git log -n1 once the number of commits reaches some high level.

If I couple that with the (in progress) deferred commit info generation pr (https://github.com/zeripath/gitea/tree/defer-last-commit-info) then we'll have a workable low memory option.

Yes this will mean that the two backends can slightly different results - but it's ultimately better than the current status.

OK - here's one more attempt at improving the original algorithm: #16042

we some slight parent tracking to assert if we are in a single parent branch - and if we can use path tracking in git rev-list to do some history simplification.

@zeripath

Thanks!, right now it would be much easier if we had a backport.

I will see if we can somehow test gitea main with this and have a proper comparison.

What about copying the testing database into one for Gitea 1.14 and one for 1.15 on codeberg-test.org? This way, we can more simply test these upstream experiments without always asking for a backport? And simply switch to the 1.14 database again for staging tests until we have this set-up on the new server?

@zeripath the issue for not using git log -n1 were the repeated calls and that it might be a bit flaky with multi-parent commits (e.g. merges)?

@ashimokawa go to my private gitea - branch backport-16042

@fnetx as far as I understand git log -n1 won't necessarily give the correct answer + it's n calls.

@zeripath Hmm cannot find the branch you mention....

I've pushed it up to github now - it was on my private gitea that I gave the details to fnetX by email, but it's up at: https://github.com/zeripath/gitea/tree/backport-16042

@zeripath

This ones leads to an error 500

 routers/repo/view.go:149:renderDirectory() [E] GetCommitsInfo: strconv.ParseInt: parsing ".gitignore\x00\xb1f\xa7\x8d}p㲸\xf3\x87\xac:x(ޕA\x12L100644 .version\x00b\xc3k\

ugh that's what I get for not testing the backport. Currently hacking on some code in go-git give me a few more minutes.

OK done.

@zeripath

Big improvement,

browsing nixpkgs from our slow test instance:

pure git: 17.6s go-git 7.7s Your backport: 6.7s

Cool. Glad to see I've finally got somewhere!

and I think I've also finally got a PR to go-git that will prevent it reading huge things into memory: go-git/go-git#330

@zeripath

Yes, thank you! This really seems good so far! Of course we have to do more tests before we can deploy it. Do you have any idea what we should test except for browsing a repo?

go-gitea / gitea

Slow repository browsing in 1.14.x #15707

Description