go-gitea / gitea

Git with a cup of tea! Painless self-hosted all-in-one software development service, including Git hosting, code review, team collaboration, package registry and CI/CD
https://gitea.com
MIT License
43.4k stars 5.33k forks source link

Gitea Running as Pod in Kubernetes invoking OOM/bad pack header? #24005

Open wattsap opened 1 year ago

wattsap commented 1 year ago

Description

hello,

I am seeing an interesting issue with cloning a repo from gitea hosted on K3s. I recently migrated from a standalone docker VM as part of a larger migration.

The repo in question is 694 MiB, and after the migration when attempting to clone the repo to another VM outside the K3s cluster I am seeing the below error:

/usr/bin/git clone --origin origin 'https://user:pass@gitea.company.net/user/repo.git' /var/lib/awx/projects/_8__homelab Cloning into '/var/lib/awx/projects/_8__homelab'... remote: Enumerating objects: 4997, done. fatal: the remote end hung up unexpectedly fatal: protocol error: bad pack header

The pods CPU and memory limits are pretty reasonable: `containers:

and when the clone is attempted, I can see it is not trying to cross those thresholds(green line indicates the resource requests, not the resource limits):

Screenshot 2023-04-08 at 1 11 19 PM

Screenshot 2023-04-08 at 1 11 37 PM

Thinking it might be something with the ingress-nginx controller, I tested from other pods in the cluster and got the same result hitting the ip of the clusterip service directly, bypassing the ingress/external routing.

In the gitea pod logs, I see the below during the clone attempt: 2023/04/08 17:06:52 [64319e8c-24] router: slow POST /user/repo.git/git-upload-pack for 10.43.3.3:0, elapsed 3981.0ms @ repo/http.go:492(repo.ServiceUploadPack) 2023/04/08 17:06:59 [64319f33] router: completed GET / for 10.43.1.248:57104, 200 OK in 15.6ms @ web/home.go:33(web.Home) 2023/04/08 17:07:09 [64319f3d] router: completed GET / for 10.43.1.248:33618, 200 OK in 4.8ms @ web/home.go:33(web.Home) 2023/04/08 17:07:19 [64319f47] router: completed GET / for 10.43.1.248:57398, 200 OK in 54.5ms @ web/home.go:33(web.Home) 2023/04/08 17:07:29 [64319f51] router: completed GET / for 10.43.1.248:35500, 200 OK in 4.5ms @ web/home.go:33(web.Home) 2023/04/08 17:07:37 [64319f59] router: completed GET / for 10.43.3.3:0, 200 OK in 32.6ms @ web/home.go:33(web.Home) 2023/04/08 17:07:39 [64319f5b] router: completed GET / for 10.43.1.248:39398, 200 OK in 4.0ms @ web/home.go:33(web.Home) 2023/04/08 17:07:49 [64319f65] router: completed GET / for 10.43.1.248:60854, 200 OK in 4.9ms @ web/home.go:33(web.Home) 2023/04/08 17:07:50 [64319f28-3] router: completed POST /user/repo.git/git-upload-pack for 10.43.3.3:0, 200 OK in 61683.7ms @ repo/http.go:492(repo.ServiceUploadPack)

I have made the following changes based on various general googling to the config for the repo in /data/git/repositories/user/repo.git `bash-5.1# cat config [core] repositoryformatversion = 0 filemode = true bare = true packedGitLimit = 256m

[pack] windowMemory = 100m packSizeLimit = 100m threads = "1"

[http] postBuffer = 200000000 bash-5.1# `

The repo itself appears to be healthy, runing git fsck on the gitea pod for the repo comes back successfully

bash-5.1# git fsck --full Checking object directories: 100% (256/256), done. Checking objects: 100% (3271/3271), done.

i'm not really sure where to go from here from a debugging perspective, i'm afraid I don't know enough about the git protocol in general - is there a tunable I am missing in the gitea config?

Thank you very much for your time.

Gitea Version

1.18.1

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.25.1

Operating System

Kubernetes

How are you running Gitea?

bare metal k3s cluster, gitea docker image gitea/gitea:1.18.1, postgres as backend DB. gitea and postgres are running as separate statefulsets, each with PVCs that are made from PVs mounting NFS shares.

Previous working configuration was single VM using docker-compose and the same gitea/postgres containers, mounting the same NFS shares as docker named volumes.

Database

PostgreSQL

wxiaoguang commented 1 year ago

I have made the following changes based on various general googling to the config for the repo in /data/git/repositories/user/repo.git ( git config )

I guess it doesn't help, right?

wattsap commented 1 year ago

prior to the [pack] section of the config, I was seeing Error Signal 9 errors in the gita server logs, which is what made me think it was OOM - after changing the config those messages are gone which is encouraging, but client side the result is still the same during a clone

wxiaoguang commented 1 year ago

I haven't tested and have no idea about how to fine tune the config at the moment (sorry), just share some of my thoughts: it seems that git process itself causes the OOM (otherwise Gitea process would have been killed). Gitea executes the git command to provide the repository content when cloning, if the git process triggers OOM and gets killed, then the client sees a broken connection / protocol. Maybe Gitea also consumes some amount of the memory, so the free memory for git is not as much as before?

wattsap commented 1 year ago

I wondered that also, but was running a top on the pod during the clone and it still had plenty of memory:

Mem: 16228700K used, 165240K free, 41992K shrd, 161464K buff, 13025436K cached
CPU:  52% usr   4% sys   0% nic   0% idle  42% io   0% irq   0% sirq
Load average: 1.77 0.68 0.30 4/484 122
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
  121   120 git      R    1513m   9%   1  43% /usr/libexec/git-core/git pack-objects --revs --thin --stdout --progress --delta-base-offset
   18    16 git      S     853m   5%   0   0% /usr/local/bin/gitea web
  120    18 git      S     5428   0%   1   0% /usr/bin/git -c protocol.version=2 -c credential.helper= -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= upload-pack --stateless-rpc /data/git/repositories/user/repo.git
   17    15 root     S     4632   0%   1   0% sshd: /usr/sbin/sshd -D -e [listener] 0 of 10-100 startups
   75    68 root     S     2596   0%   0   0% bash
  111   104 root     S     2592   0%   1   0% bash

it doesn't seem like the container OS is running out

wxiaoguang commented 1 year ago

I see, can you try to change kernel vm.overcommit_memory=1 ?

wattsap commented 1 year ago

I didn't set it that way, but it looks like it already is:

bash-5.1# cat /proc/sys/vm/overcommit_memory 
1