Closed ianw closed 4 years ago
We need a more friendly log on that error.
I'm going to make an educated guess:
I'll try to decode the log properly at some point - it's just ASCII bytes - if you could do that and put it up that would be great - and see if I can put up a pr to make log more reasonably. (It would be a single line pr.)
I'll try to decode the log properly at some point - it's just ASCII bytes - if you could do that and put it up that would be great
Ha, OK that's a fun one. It works out to
ZmF0YWw6IHRoZSByZW1vdGUgZW5kIGh1bmcgdXAgdW5leHBlY3RlZGx5Cg==
which is base64 decoded
fatal: the remote end hung up unexpectedly
which is indeed what I saw from the strace. So that confirms that :) But it would be great if it correlated which repo/request it was coming from, as I still have no idea.
I don't think any of your suggestions have changed at all. This is part of a 8-node cluster fronted by a haproxy, and it doesn't seem the other nodes are exhibiting this issue. I'm really not sure when it started, which i know isn't helpful :(
This is a trace of the git upload-pack process that seems to die and cause this issue 16131.trace.txt
I'm seeing the issue that @ianw is describing.
We have a local repo that passes git fsck --full
without errors. Talking directly (not through haproxy) to multiple getea servers shows that locally 'upload-pack' seems to block forever.
I'm using git version 2.21.0
from Fedora 30. I'm happy to run whatever debugging is needed to diagnose this.
May be some git hook that takes too long? Git would abort on the client side and the server would show that error when the hook finishes.
@guillep2k I don't think so; I'm not aware of any hooks really, and you can see in the trace I put up before it happens very quickly?
edit: looking again there's no timestamps, so this is not at all obvious :) but it's not stalling, afaict
Well, the remote end (i.e. the client) seems to be the one closing the connection, so if it's not because of a timeout, something in the response must be upsetting it. Perhaps the client has the problem? I imagine the following test: clone the repo in another machine (B) and add (B) as a remote for (A); then push your changes from (A) to (B) and see if you succeed; then you can also try pushing from (B) to Gitea after that. Otherwise it's worth looking for more info on the client side to see what's making it abort the connection.
The client still has an open socket to the gitea server:
[tony@thor ~]$ lsof -np 5978
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
git 5978 tony cwd DIR 253,3 4096 10365512 /home/tony/projects/OSF/openstack/nova_bad
git 5978 tony rtd DIR 253,1 4096 2 /
git 5978 tony txt REG 253,1 3430384 2011608 /usr/bin/git
git 5978 tony mem REG 253,3 133470802 10093721 /home/tony/projects/OSF/openstack/nova_bad/.git/objects/pack/pack-2650373f98951c72fb12e7d0a68640cd8d7fa76a.pack
git 5978 tony mem REG 253,3 681879 10093737 /home/tony/projects/OSF/openstack/nova_bad/.git/packed-refs
git 5978 tony mem REG 253,3 18517192 10093724 /home/tony/projects/OSF/openstack/nova_bad/.git/objects/pack/pack-2650373f98951c72fb12e7d0a68640cd8d7fa76a.idx
git 5978 tony mem REG 253,1 217750512 1966387 /usr/lib/locale/locale-archive
git 5978 tony mem REG 253,1 6697832 1972738 /usr/lib64/libc-2.29.so
git 5978 tony mem REG 253,1 149504 1973332 /usr/lib64/librt-2.29.so
git 5978 tony mem REG 253,1 823192 1972989 /usr/lib64/libpthread-2.29.so
git 5978 tony mem REG 253,1 131200 1966784 /usr/lib64/libz.so.1.2.11
git 5978 tony mem REG 253,1 630304 1977533 /usr/lib64/libpcre2-8.so.0.8.0
git 5978 tony mem REG 253,1 436736 1972262 /usr/lib64/ld-2.29.so
git 5978 tony 0r FIFO 0,13 0t0 12430326 pipe
git 5978 tony 1w FIFO 0,13 0t0 12430327 pipe
git 5978 tony 2u CHR 136,16 0t0 19 /dev/pts/16
git 5978 tony 3u IPv4 12430315 0t0 TCP 192.168.1.14:41436->38.108.68.172:hbci (ESTABLISHED)
The last thing git (with GIT_CURL_VERBOSE) prints is:
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/x-git-upload-pack-result
< Set-Cookie: lang=en-US; Path=/; Max-Age=2147483647
< Set-Cookie: i_like_gitea=198158244041ff80; Path=/; HttpOnly
< Set-Cookie: _csrf=SKqrpPpbdfEPlzwfA3rP7JNLK4E6MTU3NDEyNTU4NTk0ODA1Njk1MA%3D%3D; Path=/; Expires=Wed, 20 Nov 2019 01:06:25 GMT; HttpOnly
< X-Frame-Options: SAMEORIGIN
< Date: Tue, 19 Nov 2019 01:06:26 GMT
< Transfer-Encoding: chunked
<
* Connection #0 to host gitea01.opendev.org left intact
So AFAICT the client isn't dropping the connection, an strace on the client doesn't show the any calls to close() on the fd
OK, but you can see that:
12027 execve("/usr/bin/git", ["/usr/bin/git", "upload-pack", "--stateless-rpc", "/data/git/repositories/x/docker-machine-openstack.git"], 0xc009602190 /* 8 vars */ <unfinished ...>
...
12027 write(2, "fatal: the remote end hung up unexpectedly\n", 43 <unfinished ...>
execve
means that the process will transfer into /usr/bin/git
, so anything from there on is git's doing. And that process (git) is printing the "fatal: ..." error. I cannot however pinpoint from your strace which system call is currently returning any errors (they all seem to return normal >= 0 values).
Maybe your repo is not in a local file system?
@guillep2k all the repos are local; gitea is running in a container and has access to the repos that way. The only thing pushing to gitea is gerrit as it replicates. Otherwise it's read traffic. The interesting thing is that there's a TCP load-balancer between each gitea node. However, haproxy does not appear to be showing any sort of errors
I don't understand: why are there two instances of gitea? A load balancer suggests a single file-system shared between the instances and that should give you all kinds of problems (not necessarily this one, however). Anyway, it doesn't seem to be Gitea but git
who's spitting the errors.
As a last messure, I'd consider upgrading to 1.9.6. There has been some rework in the way Gitea handles git resources; especifically to avoid keeping open handles (#8901, #8958). This may be related to your issue.
@guillep2k there's no shared file-system in this case; each separate gitea node is on it's own server. gerrit's repos are the one-true-source, and it replicates changes out to the gitea nodes -- it is the only writer. I agree that it looks like git is dying here, but it's very odd that this just started happening.
We are seeing this across multiple servers, however. It's not just one.
I agree we're not being super helpful here as I don't know exactly when this started. I've tried to dump the connection, but it's all just encrypted binary stuff which will take a whole heap of effort to decode in a tracer. But I didn't see anything obvious.
I think we'll have to try upgrading at this point, and if we see the issue continue with 1.9.6 dig deeper.
@ianw and team upgraded to 1.9.6 and the problem persists.
With the repo below we've recreated the "hang" on a number of Linux distros and git versions
Grab the repo @ https://ozlabs.org/~tony/nova_bad.tar.gz and unpack it and then run something like:
GIT_TRACE=2 GIT_CURL_VERBOSE=2 GIT_TRACE_PERFORMANCE=2 GIT_TRACE_PACK_ACCESS=2 GIT_TRACE_PACKET=2 GIT_TRACE_PACKFILE=2 GIT_TRACE_SETUP=2 GIT_TRACE_SHALLOW=2 git -c http.sslVerify=false remote -v update
It's entirely possible that there is a problem with that repo that git fsck
doesn't detect but It seems like the client connects to gitea, which exec's git. git dies but gitea doesn't notice and doesn't clean up the client connection.
Any ideas of what else to try? If not I'll using a local gitea server at some point next week
I suspect that I have the same problem: An existing gitea repo working fine until some commit. After that pushing hangs and the git remote http process hogs a complete cpu core. After 8 minutes a timeout occurs. I deleted the repo and tried to do a clean push, same problem. Tried aggressive git gc, full fsck, did not help. Pushing the very same repo to gitlab works without a problem, though it is pushed using ssh instead of https. Any way to triage this? (I run gitea with sqlite in kubernetes on ARM64 behin traefik as reverse proxy and another traefik as ingress, but I doubt that this has to do with the issue. I think it would help to know which commit triggered the problem, but I have no idea how to search for it.)
@tbreeds @everflux Is that repository a public one? If you can share it or find a simliar project, that would help to find the problem.
Unfortunately not - I know it is frustrating. But if there is a script or something to triage the exact commit I am happy to help out. (Something like history replay and try-push each commit?)
@tbreeds @everflux Is that repository a public one? If you can share it or find a simliar project, that would help to find the problem.
The repo I'm using is public and I have a tarball of my clone that seems to reliably reproduce the git issue on a variety of clients/distributions
Can you share the project for @lunny to be able to triage this?
@everflux @lunny I'm not sure exactly what you need me to share?
I assume it would be helpful if you could share your repo and identify the commit from which point the gitea Push fails
Hi peeps! So I finally got round to looking at this.
First of all I can't replicate this on master on sqlite, but I do suspect however, that this might be another example of the tcp port exhaustion issue as I see that you're using MySQL.
You will need to set MAX_OPEN_CONNS, MAX_IDLE_CONNS and CONN_MAX_LIFETIME. In particular you should set MAX_OPEN_CONNS = MAX_IDLE_CONNS.
For me it seems that the issue was fixed by forcing the workdir permissions on all files to be reset. It was not the sql lite but the git repository itself it seems.
@everflux if the permissions on the git repository were messed up I could imagine that git would fail in this way.
[x]
):Description
One of our gitea hosts has started showing a constant error
The numbers are always the same; and it doesn't tell me what repos it is associated with. A slightly longer extract:
I have deleted all the repos and recreated them (using "Reinitialize all missing Git repositories for which records exist") but it is still happened.
I then started to strace the gitea process to see what was going on. I managed to catch what I think is the cause of this error:
This command matches what would be running at https://github.com/go-gitea/gitea/blob/master/routers/repo/http.go#L418 and has the 128 exit code, so I think it matches up. It would be really helpful if this was expressed in the error message.
I'm investigating the host to see if there's any networking type issues that might be affecting clients, although it doesn't look like it to me.
One user who seemed to hit this said that their clone was just stuck for hours, they didn't get an specific error message. They were using git 2.21.0