gocd / gocd

GoCD - Continuous Delivery server main repository
https://www.gocd.org
Apache License 2.0
7.05k stars 971 forks source link

Error running garbage collection - the pack file might be corrupt: Unknown object type 0 #12701

Closed m5lapp closed 3 weeks ago

m5lapp commented 1 month ago
Issue Type
Summary

We have been running GoCD in production for several years and noticed recently that the Weekly garbage collection of the config repository is failing due to a corrupted pack file. This is only affecting our production instance and we are not sure when it happened/appeared; the longest our logs go back is early February and it was happening then, so at least a few months.

GoCD itself is generally running OK, though around every four days we have seeing high CPU-usage/throttling and slow-down over a period of several hours until GoCD eventually restarts. This may or may not be related to the issue with the corrupted pack file, possibly it is if the GC has not run for a long time? A restart always restores it to normal performance in these cases.

Environment

We run GoCD in containers in a Kubernetes cluster using a custom image that builds on top of gocd/gocd-server:v23.2.0. The file system where the config repository is saved is backed by NFS. Garbage collections are configured to run using go.config.repo.gc.periodic=Y, but we do not specify any of the other go.config.repo.gc.* values, so we take the default values for those.

"Version": "23.2.0 (16938-e2b2936f3b573008381a3702139ebcb4383dc0b1)"
"OS Information": {
    "OS Name": "Linux",
    "OS Version": "4.18.0-513.18.1.el8_9.x86_64",
    "System Architecture": "amd64",
    "Available Processors": 6,
    "Average System Load": 0.68
}
Basic environment details
Additional Environment Details
  "Server Health Information": {
    "Messages Count": 1,
    "Messages": [
      {
        "message": "Action required: Run \u0027git gc\u0027 on config.git repo",
        "detail": "Number of loose objects in your Configuration repository(config.git) has grown beyond the configured threshold. As the size of config repo increases, the config save operations tend to slow down drastically. It is recommended that you run \u0027git gc\u0027 from \u0027\u0026lt;go server installation directory\u0026gt;/db/config.git/\u0027 to address this problem. Go can do this automatically on a periodic basis if you enable automatic GC. \u003ca target\u003d\u0027_blank\u0027 href\u003d\u0027https://docs.gocd.org/23.2.0/advanced_usage/config_repo.html\u0027\u003eread more...\u003c/a\u003e",
        "level": "WARNING"
      }
    ]
  },
Steps to Reproduce
  1. Unknown, presumably at some point in time, something corrupted one of the .pack files in the db/config.git/.git/ directory
Expected Results

We expect the garbage collection for the config repository to run successfully according to the default cron schedule.

Actual Results

The config repository garbage collection fails and the three log messages listed further down are output.

Possible Fix

We took a copy of the Git repo and did some Googling to try and find solutions to this that others may have had as follows:

$ git verify-pack ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.idx
fatal: pack has bad object at offset 11239672: unknown object type 0

$ git fsck
Checking object directories: 100% (256/256), done.
error: ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack pack checksum mismatch
error: index CRC mismatch for object 5ac6a125860f908ba0a83d3b33344b14b454a2b8 from ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack at offset 11239672
error: unknown object type 0 at offset 11239672 in ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack
error: cannot unpack 5ac6a125860f908ba0a83d3b33344b14b454a2b8 from ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack at offset 11239672
Checking objects: 100% (621318/621318), done.
error: HEAD: invalid reflog entry 5ac6a125860f908ba0a83d3b33344b14b454a2b8
error: HEAD: invalid reflog entry 5ac6a125860f908ba0a83d3b33344b14b454a2b8
error: refs/heads/master: invalid reflog entry 5ac6a125860f908ba0a83d3b33344b14b454a2b8
error: refs/heads/master: invalid reflog entry 5ac6a125860f908ba0a83d3b33344b14b454a2b8

$ git gc
error: unknown object type 0 at offset 11239672 in ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack
error: Could not read 5ac6a125860f908ba0a83d3b33344b14b454a2b8
fatal: Failed to traverse parents of commit 17bcaf7acd76bf10a01f7896d2b379f2a2912b2f
fatal: failed to run repack

$ git repack
error: unknown object type 0 at offset 11239672 in ./objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack
error: Could not read 5ac6a125860f908ba0a83d3b33344b14b454a2b8
fatal: Failed to traverse parents of commit 17bcaf7acd76bf10a01f7896d2b379f2a2912b2f

$ git checkout -f HEAD

$ rm .git/index

$ git reset

$ git fsck --full --no-dangling
Checking object directories: 100% (256/256), done.
error: .git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack pack checksum mismatch
error: index CRC mismatch for object 5ac6a125860f908ba0a83d3b33344b14b454a2b8 from .git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack at offset 11239672
error: unknown object type 0 at offset 11239672 in .git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack
error: cannot unpack 5ac6a125860f908ba0a83d3b33344b14b454a2b8 from .git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack at offset 11239672
Checking objects: 100% (621318/621318), done.config.git on ¿ master [?] ¿ took 1m6s

$ git gc
error: unknown object type 0 at offset 11239672 in .git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack
error: Could not read 5ac6a125860f908ba0a83d3b33344b14b454a2b8
fatal: Failed to traverse parents of commit 17bcaf7acd76bf10a01f7896d2b379f2a2912b2f
fatal: failed to run repack

We were wondering if it would be possible to do something along the lines of deleting the .git/ directory of the config repository and then initialising a new one with git init, does that sounds reasonable?

Log snippets
2024-04-14 07:00:01,352 [goThreadPoolTaskScheduler-8] INFO  c.t.go.service.ConfigRepository - Before GC: {numberOfPackedRefs=0, numberOfPackFiles=2, sizeOfLooseObjects=11907390, numberOfLooseObjects=2486, numberOfPackedObjects=621318, numberOfBitmaps=280, numberOfLooseRefs=2, sizeOfPackedObjects=145915443}

2024-04-14 07:00:21,365 [goThreadPoolTaskScheduler-8] ERROR o.e.j.i.storage.file.PackDirectory - Exception caught while accessing pack file /go-working-dir/db/config.git/.git/objects/pack/pack-ece2c8151ee165af986a95391d2a16600eb80d3e.pack, the pack file might be corrupt. Caught 1 consecutive errors while trying to read this pack. java.io.IOException: Unknown object type 0.
jvm 1    |  at org.eclipse.jgit.internal.storage.file.Pack.load(Pack.java:862)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.Pack.get(Pack.java:273)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.PackDirectory.open(PackDirectory.java:223)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedObject(ObjectDirectory.java:423)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openPackedFromSelfOrAlternate(ObjectDirectory.java:386)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObjectWithoutRestoring(ObjectDirectory.java:376)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.ObjectDirectory.openObject(ObjectDirectory.java:361)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:140)
jvm 1    |  at org.eclipse.jgit.revwalk.RevWalk.getCachedBytes(RevWalk.java:1164)
jvm 1    |  at org.eclipse.jgit.revwalk.RevCommit.parseHeaders(RevCommit.java:151)
jvm 1    |  at org.eclipse.jgit.revwalk.PendingGenerator.next(PendingGenerator.java:122)
jvm 1    |  at org.eclipse.jgit.revwalk.RevWalk.next(RevWalk.java:625)
jvm 1    |  at org.eclipse.jgit.revwalk.ObjectWalk.next(ObjectWalk.java:365)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.findObjectsToPack(PackWriter.java:2108)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:1005)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:921)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.writePack(GC.java:1287)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.repack(GC.java:866)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.doGc(GC.java:286)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.gc(GC.java:235)
jvm 1    |  at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:180)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository$2.run(ConfigRepository.java:405)
jvm 1    |  at com.thoughtworks.go.util.VoidThrowingFn.call(VoidThrowingFn.java:25)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository.doLocked(ConfigRepository.java:140)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository.garbageCollect(ConfigRepository.java:396)
jvm 1    |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
jvm 1    |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
jvm 1    |  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
jvm 1    |  at java.base/java.lang.reflect.Method.invoke(Unknown Source)
jvm 1    |  at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
jvm 1    |  at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
jvm 1    |  at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81)
jvm 1    |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
jvm 1    |  at java.base/java.lang.Thread.run(Unknown Source)

2024-04-14 07:00:21,438 [goThreadPoolTaskScheduler-8] ERROR o.s.s.s.TaskUtils$LoggingErrorHandler - Unexpected error occurred in scheduled task. org.eclipse.jgit.api.errors.JGitInternalException: Garbage collection failed.
jvm 1    |  at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:198)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository$2.run(ConfigRepository.java:405)
jvm 1    |  at com.thoughtworks.go.util.VoidThrowingFn.call(VoidThrowingFn.java:25)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository.doLocked(ConfigRepository.java:140)
jvm 1    |  at com.thoughtworks.go.service.ConfigRepository.garbageCollect(ConfigRepository.java:396)
jvm 1    |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
jvm 1    |  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
jvm 1    |  at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
jvm 1    |  at java.base/java.lang.reflect.Method.invoke(Unknown Source)
jvm 1    |  at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
jvm 1    |  at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
jvm 1    |  at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:81)
jvm 1    |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
jvm 1    |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
jvm 1    |  at java.base/java.lang.Thread.run(Unknown Source)
jvm 1    | Caused by: org.eclipse.jgit.errors.MissingObjectException: Missing commit 5ac6a125860f908ba0a83d3b33344b14b454a2b8
jvm 1    |  at org.eclipse.jgit.internal.storage.file.WindowCursor.open(WindowCursor.java:145)
jvm 1    |  at org.eclipse.jgit.revwalk.RevWalk.getCachedBytes(RevWalk.java:1164)
jvm 1    |  at org.eclipse.jgit.revwalk.RevCommit.parseHeaders(RevCommit.java:151)
jvm 1    |  at org.eclipse.jgit.revwalk.PendingGenerator.next(PendingGenerator.java:122)
jvm 1    |  at org.eclipse.jgit.revwalk.RevWalk.next(RevWalk.java:625)
jvm 1    |  at org.eclipse.jgit.revwalk.ObjectWalk.next(ObjectWalk.java:365)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.findObjectsToPack(PackWriter.java:2108)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:1005)
jvm 1    |  at org.eclipse.jgit.internal.storage.pack.PackWriter.preparePack(PackWriter.java:921)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.writePack(GC.java:1287)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.repack(GC.java:866)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.doGc(GC.java:286)
jvm 1    |  at org.eclipse.jgit.internal.storage.file.GC.gc(GC.java:235)
jvm 1    |  at org.eclipse.jgit.api.GarbageCollectCommand.call(GarbageCollectCommand.java:180)
jvm 1    |  ... 17 common frames omitted
Code snippets/Screenshots

N/A

Any other info

N/A

chadlwilson commented 1 month ago

Yuck, this is unfortunate. Never seen that before (probably the beauties of NFS?) It is possible to have the GoCD config repo bootstrapped from and pushed to a remote, but I've personally been unsure of the merits of that. In any case, that won't help you now.

I believe it should be possible to start fresh with git init like you suggest, but you will also need to add+commit the existing cruise-config.xml before GoCD will handle it correctly at startup. Branch will need to be master. If you want to reverse engineer from how the code behaves at start-up/initialization you can check here.

You'll definitely lose the ability to see the "Config Changed" history between pipeline runs, but I think that it will otherwise work OK.

I'd suggest you experiment with your process on a local GoCD first, either via a container with a mounted /godata or just via https://www.gocd.org/test-drive-gocd.html - doesn't have to be with your "real" config repo, perhaps just one with a few manual commits driven by the GoCD UI or a few pipeline runs with config changes between them.

As to your performance/CPU issues, it would sound odd to me if it's causing that, but it's probably wise to fix this first before digging too deeply into that.

m5lapp commented 4 weeks ago

Thanks for getting back to me, I've been having a bit of play around with our test GoCD instance and deleting and re-initialising the Git repo does seem to work as expected, I've got a bit more testing to do but it's looking promising so far.

Before I do the work in production, you mentioned that the GoCD config can be automatically pushed to a remote. Given what's happened, it sounds like a good idea for us to look into doing that. However, I was unable to find any information about how to do that anywhere, do you know of any resources that exist?

chadlwilson commented 4 weeks ago

There's no magic built in to push to a remote, but it's just a regular git repository so you could cron/supervisord something that does this periodically using command line git.

Alternatively, and probably better, just rely upon the built-in scheduled backup feature which includes zips of the configuration db/repo: https://docs.gocd.org/current/advanced_usage/one_click_backup.html ...

Or rely on lower level filesystem snapshot backups of your NFS volume.

sriramnrn commented 4 weeks ago

On Sat, 20 Apr 2024 at 1:10 AM, Chad Wilson @.***> wrote:

There's no magic built in to push to a remote, but it's just a regular git repository so you could cron/supervisord something that does this periodically using command line git.

Alternatively, and probably better, just rely upon the built-in scheduled backup feature which includes zips of the configuration db/repo: https://docs.gocd.org/current/advanced_usage/one_click_backup.html ...

Or rely on lower level filesystem snapshot backups of your NFS volume.

I have used both cron as well as inotify in the past.

Here's a ChatGPT answer : https://chat.openai.com/share/7eda067a-3708-4a94-a6c7-0ffde86eeafa

Reply to this email directly, view it on GitHub https://github.com/gocd/gocd/issues/12701#issuecomment-2066972128, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAF5JGWMCHMU6KI4EDEEJNLY6FFZ5AVCNFSM6AAAAABGJWHZJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWHE3TEMJSHA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sriramnrn commented 4 weeks ago

My apologies. I realised that I might delete that code snippet by accident. Here's the actual code :

#!/bin/bash

# Define your local Git repository path
repo_path="/path/to/your/git/repository"

# Define your remote repository URL
remote_url="git@github.com:user/repository.git"

# Function to push changes
push_changes() {
    cd "$repo_path" || exit
    git push origin master # or your preferred branch
}

# Watch for changes using inotifywait
inotifywait -r -m -e modify,create,delete,move "$repo_path" |
    while read -r directory events filename; do
        # Only act when changes occur in the Git repository
        if [[ "$directory/$filename" =~ ^$repo_path ]]; then
            echo "Changes detected in $directory/$filename. Pushing changes..."
            push_changes
        fi
    done
m5lapp commented 3 weeks ago

Thanks both, I've now fixed the Git repo in our production GoCD instance. I struggled to figure out the commit message format it needed with the correct checksums and things, it was always coming up with an error, but I found that just doing the git add without the git commit was sufficient and GoCD would detect this and do the commit for me.

We elected not to use a remote for the time being, if we see this issue again that we'll look to potentially do that. For now, we're chalking it up to a blip.

chadlwilson commented 3 weeks ago

If you don't have it configured already, it's still a very good idea to be using at least the built in GoCD backup function. That's relatively trivial to set up since GoCD manages the cron and would have allowed you to roll back somewhere with this without losing everything.

m5lapp commented 3 weeks ago

Thanks Chad, we do have backups configured and keep them for 14 days, but this issue with the corruption has been around since at least early February (as far back as our logs go, likely much longer), so would be no use in this case unfortunately.

chadlwilson commented 3 weeks ago

Fair enough. Could consider keeping them for longer I suppose. Shouldn't be too large in size once compressed?