Error: routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 #9006

Closed ianw closed 4 years ago

Gitea version (or commit ref): Gitea version v1.9.5 built with GNU Make 4.2.1, go1.13.4 : bindata,
Git version: git version 2.20.1
Operating system:
Database (use [x]):
- [ ] PostgreSQL
- [x] MySQL
- [ ] MSSQL
- [ ] SQLite
Can you reproduce the bug at https://try.gitea.io:
- [ ] Yes (provide example URL)
- [x] No
- [ ] Not relevant

Description

One of our gitea hosts has started showing a constant error

019/11/14 23:30:42 routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 - {[102 97 116 97 108 58 32 116 104 101 32 114 101 109 111 116 101 32 101 110 100 32 104 117 110 103 32 117 112 32 117 110 101 120 112 101 99 116 101 100 108 121 10] 0 0}

The numbers are always the same; and it doesn't tell me what repos it is associated with. A slightly longer extract:

[Macaron] 2019-11-14 23:30:35: Completed POST /openstack/performance-docs/git-upload-pack 200 OK in 15.803779ms
2019/11/14 23:30:35 routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 - {[102 97 116 97 108 58 32 116 104 101 32 114 101 109 111 116 101 32 101 110 100 32 104 117 110 103 32 117 112 32 117 110 101 120 112 101 99 116 101 100 108 121 10] 0 0}
[Macaron] 2019-11-14 23:30:35: Started POST /openstack/performance-docs/git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Completed POST /openstack/performance-docs/git-upload-pack 200 OK in 21.116993ms
[Macaron] 2019-11-14 23:30:35: Started POST /openstack/openstack-helm-infra/git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Completed GET /openstack/ironic-python-agent/info/refs?service=git-upload-pack 200 OK in 5.859975155s
[Macaron] 2019-11-14 23:30:35: Started GET /x/networking-fortinet/info/refs?service=git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Started GET /openstack/designate/info/refs?service=git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Completed GET /openstack/openstack-ansible-os_gnocchi/info/refs?service=git-upload-pack 200 OK in 1.336008289s
[Macaron] 2019-11-14 23:30:35: Started GET /openstack/tacker-specs/info/refs?service=git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Started POST /openstack/openstack-ansible-os_gnocchi/git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Completed POST /openstack/openstack-ansible-os_gnocchi/git-upload-pack 200 OK in 44.754598ms
2019/11/14 23:30:35 routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 - {[102 97 116 97 108 58 32 116 104 101 32 114 101 109 111 116 101 32 101 110 100 32 104 117 110 103 32 117 112 32 117 110 101 120 112 101 99 116 101 100 108 121 10] 0 0}
[Macaron] 2019-11-14 23:30:35: Completed GET /x/networking-fortinet/info/refs?service=git-upload-pack 200 OK in 292.002759ms
[Macaron] 2019-11-14 23:30:35: Started POST /openstack/openstack-ansible-os_gnocchi/git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:35: Completed POST /openstack/openstack-ansible-os_gnocchi/git-upload-pack 200 OK in 29.21295ms
[Macaron] 2019-11-14 23:30:36: Completed POST /openstack/openstack-helm-infra/git-upload-pack 200 OK in 609.53835ms
[Macaron] 2019-11-14 23:30:36: Started GET /openstack/openstack-ansible-os_sahara/info/refs?service=git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:36: Completed GET /openstack/fuel-qa.git/info/refs?service=git-upload-pack 200 OK in 19.056143413s
[Macaron] 2019-11-14 23:30:36: Completed GET /openstack/tacker-specs/info/refs?service=git-upload-pack 200 OK in 760.904443ms
[Macaron] 2019-11-14 23:30:36: Started POST /openstack/tacker-specs/git-upload-pack for 38.108.68.124
[Macaron] 2019-11-14 23:30:36: Completed POST /openstack/tacker-specs/git-upload-pack 200 OK in 26.381308ms
2019/11/14 23:30:36 routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 - {[102 97 116 97 108 58 32 116 104 101 32 114 101 109 111 116 101 32 101 110 100 32 104 117 110 103 32 117 112 32 117 110 101 120 112 101 99 116 101 100 108 121 10] 0 0}
[Macaron] 2019-11-14 23:30:36: Started POST /openstack/tacker-specs/git-upload-pack for 38.108.68.124

I have deleted all the repos and recreated them (using "Reinitialize all missing Git repositories for which records exist") but it is still happened.

I then started to strace the gitea process to see what was going on. I managed to catch what I think is the cause of this error:

12027 execve("/usr/bin/git", ["/usr/bin/git", "upload-pack", "--stateless-rpc", "/data/git/repositories/x/docker-machine-openstack.git"], 0xc009602190 /* 8 vars */ <unfinished ...>
...
12027 write(2, "fatal: the remote end hung up unexpectedly\n", 43 <unfinished ...>
12027 <... write resumed> )             = 43
12027 exit_group(128 <unfinished ...>
12027 <... exit_group resumed>)         = ?
12027 +++ exited with 128 +++

This command matches what would be running at https://github.com/go-gitea/gitea/blob/master/routers/repo/http.go#L418 and has the 128 exit code, so I think it matches up. It would be really helpful if this was expressed in the error message.

I'm investigating the host to see if there's any networking type issues that might be affecting clients, although it doesn't look like it to me.

One user who seemed to hit this said that their clone was just stuck for hours, they didn't get an specific error message. They were using git 2.21.0

We need a more friendly log on that error.

I'm going to make an educated guess:

Please resynchronize hooks.
Please ensure that your INTERNAL_ROOT_URL makes sense - so that hooks can talk to your server.
Please ensure that your gitea-repositories are mounted in such a way that the hooks are executable.

I'll try to decode the log properly at some point - it's just ASCII bytes - if you could do that and put it up that would be great - and see if I can put up a pr to make log more reasonably. (It would be a single line pr.)

I'll try to decode the log properly at some point - it's just ASCII bytes - if you could do that and put it up that would be great

Ha, OK that's a fun one. It works out to

ZmF0YWw6IHRoZSByZW1vdGUgZW5kIGh1bmcgdXAgdW5leHBlY3RlZGx5Cg==

which is base64 decoded

fatal: the remote end hung up unexpectedly

which is indeed what I saw from the strace. So that confirms that :) But it would be great if it correlated which repo/request it was coming from, as I still have no idea.

I don't think any of your suggestions have changed at all. This is part of a 8-node cluster fronted by a haproxy, and it doesn't seem the other nodes are exhibiting this issue. I'm really not sure when it started, which i know isn't helpful :(

This is a trace of the git upload-pack process that seems to die and cause this issue 16131.trace.txt

I'm seeing the issue that @ianw is describing.

We have a local repo that passes git fsck --full without errors. Talking directly (not through haproxy) to multiple getea servers shows that locally 'upload-pack' seems to block forever.

I'm using git version 2.21.0 from Fedora 30. I'm happy to run whatever debugging is needed to diagnose this.

May be some git hook that takes too long? Git would abort on the client side and the server would show that error when the hook finishes.

@guillep2k I don't think so; I'm not aware of any hooks really, and you can see in the trace I put up before it happens very quickly?

edit: looking again there's no timestamps, so this is not at all obvious :) but it's not stalling, afaict

Well, the remote end (i.e. the client) seems to be the one closing the connection, so if it's not because of a timeout, something in the response must be upsetting it. Perhaps the client has the problem? I imagine the following test: clone the repo in another machine (B) and add (B) as a remote for (A); then push your changes from (A) to (B) and see if you succeed; then you can also try pushing from (B) to Gitea after that. Otherwise it's worth looking for more info on the client side to see what's making it abort the connection.

The client still has an open socket to the gitea server:

[tony@thor ~]$ lsof -np 5978
COMMAND  PID USER   FD   TYPE   DEVICE  SIZE/OFF     NODE NAME
git     5978 tony  cwd    DIR    253,3      4096 10365512 /home/tony/projects/OSF/openstack/nova_bad
git     5978 tony  rtd    DIR    253,1      4096        2 /
git     5978 tony  txt    REG    253,1   3430384  2011608 /usr/bin/git
git     5978 tony  mem    REG    253,3 133470802 10093721 /home/tony/projects/OSF/openstack/nova_bad/.git/objects/pack/pack-2650373f98951c72fb12e7d0a68640cd8d7fa76a.pack
git     5978 tony  mem    REG    253,3    681879 10093737 /home/tony/projects/OSF/openstack/nova_bad/.git/packed-refs
git     5978 tony  mem    REG    253,3  18517192 10093724 /home/tony/projects/OSF/openstack/nova_bad/.git/objects/pack/pack-2650373f98951c72fb12e7d0a68640cd8d7fa76a.idx
git     5978 tony  mem    REG    253,1 217750512  1966387 /usr/lib/locale/locale-archive
git     5978 tony  mem    REG    253,1   6697832  1972738 /usr/lib64/libc-2.29.so
git     5978 tony  mem    REG    253,1    149504  1973332 /usr/lib64/librt-2.29.so
git     5978 tony  mem    REG    253,1    823192  1972989 /usr/lib64/libpthread-2.29.so
git     5978 tony  mem    REG    253,1    131200  1966784 /usr/lib64/libz.so.1.2.11
git     5978 tony  mem    REG    253,1    630304  1977533 /usr/lib64/libpcre2-8.so.0.8.0
git     5978 tony  mem    REG    253,1    436736  1972262 /usr/lib64/ld-2.29.so
git     5978 tony    0r  FIFO     0,13       0t0 12430326 pipe
git     5978 tony    1w  FIFO     0,13       0t0 12430327 pipe
git     5978 tony    2u   CHR   136,16       0t0       19 /dev/pts/16
git     5978 tony    3u  IPv4 12430315       0t0      TCP 192.168.1.14:41436->38.108.68.172:hbci (ESTABLISHED)

The last thing git (with GIT_CURL_VERBOSE) prints is:

* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/x-git-upload-pack-result
< Set-Cookie: lang=en-US; Path=/; Max-Age=2147483647
< Set-Cookie: i_like_gitea=198158244041ff80; Path=/; HttpOnly
< Set-Cookie: _csrf=SKqrpPpbdfEPlzwfA3rP7JNLK4E6MTU3NDEyNTU4NTk0ODA1Njk1MA%3D%3D; Path=/; Expires=Wed, 20 Nov 2019 01:06:25 GMT; HttpOnly
< X-Frame-Options: SAMEORIGIN
< Date: Tue, 19 Nov 2019 01:06:26 GMT
< Transfer-Encoding: chunked
< 
* Connection #0 to host gitea01.opendev.org left intact

So AFAICT the client isn't dropping the connection, an strace on the client doesn't show the any calls to close() on the fd

OK, but you can see that:

12027 execve("/usr/bin/git", ["/usr/bin/git", "upload-pack", "--stateless-rpc", "/data/git/repositories/x/docker-machine-openstack.git"], 0xc009602190 /* 8 vars */ <unfinished ...>
...
12027 write(2, "fatal: the remote end hung up unexpectedly\n", 43 <unfinished ...>

execve means that the process will transfer into /usr/bin/git, so anything from there on is git's doing. And that process (git) is printing the "fatal: ..." error. I cannot however pinpoint from your strace which system call is currently returning any errors (they all seem to return normal >= 0 values).

Maybe your repo is not in a local file system?

@guillep2k all the repos are local; gitea is running in a container and has access to the repos that way. The only thing pushing to gitea is gerrit as it replicates. Otherwise it's read traffic. The interesting thing is that there's a TCP load-balancer between each gitea node. However, haproxy does not appear to be showing any sort of errors

I don't understand: why are there two instances of gitea? A load balancer suggests a single file-system shared between the instances and that should give you all kinds of problems (not necessarily this one, however). Anyway, it doesn't seem to be Gitea but git who's spitting the errors.

As a last messure, I'd consider upgrading to 1.9.6. There has been some rework in the way Gitea handles git resources; especifically to avoid keeping open handles (#8901, #8958). This may be related to your issue.

@guillep2k there's no shared file-system in this case; each separate gitea node is on it's own server. gerrit's repos are the one-true-source, and it replicates changes out to the gitea nodes -- it is the only writer. I agree that it looks like git is dying here, but it's very odd that this just started happening.

We are seeing this across multiple servers, however. It's not just one.

I agree we're not being super helpful here as I don't know exactly when this started. I've tried to dump the connection, but it's all just encrypted binary stuff which will take a whole heap of effort to decode in a tracer. But I didn't see anything obvious.

I think we'll have to try upgrading at this point, and if we see the issue continue with 1.9.6 dig deeper.

@ianw and team upgraded to 1.9.6 and the problem persists.

With the repo below we've recreated the "hang" on a number of Linux distros and git versions

Grab the repo @ https://ozlabs.org/~tony/nova_bad.tar.gz and unpack it and then run something like:

GIT_TRACE=2 GIT_CURL_VERBOSE=2 GIT_TRACE_PERFORMANCE=2 GIT_TRACE_PACK_ACCESS=2 GIT_TRACE_PACKET=2 GIT_TRACE_PACKFILE=2 GIT_TRACE_SETUP=2 GIT_TRACE_SHALLOW=2 git -c http.sslVerify=false remote -v update

It's entirely possible that there is a problem with that repo that git fsck doesn't detect but It seems like the client connects to gitea, which exec's git. git dies but gitea doesn't notice and doesn't clean up the client connection.

Any ideas of what else to try? If not I'll using a local gitea server at some point next week

I suspect that I have the same problem: An existing gitea repo working fine until some commit. After that pushing hangs and the git remote http process hogs a complete cpu core. After 8 minutes a timeout occurs. I deleted the repo and tried to do a clean push, same problem. Tried aggressive git gc, full fsck, did not help. Pushing the very same repo to gitlab works without a problem, though it is pushed using ssh instead of https. Any way to triage this? (I run gitea with sqlite in kubernetes on ARM64 behin traefik as reverse proxy and another traefik as ingress, but I doubt that this has to do with the issue. I think it would help to know which commit triggered the problem, but I have no idea how to search for it.)

@tbreeds @everflux Is that repository a public one? If you can share it or find a simliar project, that would help to find the problem.

Unfortunately not - I know it is frustrating. But if there is a script or something to triage the exact commit I am happy to help out. (Something like history replay and try-push each commit?)

@tbreeds @everflux Is that repository a public one? If you can share it or find a simliar project, that would help to find the problem.

The repo I'm using is public and I have a tarball of my clone that seems to reliably reproduce the git issue on a variety of clients/distributions

Can you share the project for @lunny to be able to triage this?

@everflux @lunny I'm not sure exactly what you need me to share?

I assume it would be helpful if you could share your repo and identify the commit from which point the gitea Push fails

Hi peeps! So I finally got round to looking at this.

First of all I can't replicate this on master on sqlite, but I do suspect however, that this might be another example of the tcp port exhaustion issue as I see that you're using MySQL.

You will need to set MAX_OPEN_CONNS, MAX_IDLE_CONNS and CONN_MAX_LIFETIME. In particular you should set MAX_OPEN_CONNS = MAX_IDLE_CONNS.

For me it seems that the issue was fixed by forcing the workdir permissions on all files to be reset. It was not the sql lite but the git repository itself it seems.

@everflux if the permissions on the git repository were messed up I could imagine that git would fail in this way.

go-gitea / gitea

Error: routers/repo/http.go:430:serviceRPC() [E] Fail to serve RPC(upload-pack): exit status 128 #9006

Description