[Bug] Duckdb corruption after 2.2.5 upgrade

DerekTBrown commented 5 months ago

Kubecost Version

2.2.5

Kubernetes Version

1.25

Kubernetes Platform

Other (specify in description)

Description

After a 2.2.5 upgrade, I see the aggregator container failing to start with the following message:

2024-06-03T18:58:50.534412476Z ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: setting up migrations: opening '/var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write': could not open database: IO Error: Corrupt database file: computed checksum 2550608162518328766 does not match stored checksum 6531264450723538710 in block
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x177fe55]

Steps to reproduce

Launch kubecost and wait.

Expected behavior

No pod crashes.

Impact

No response

Screenshots

No response

Logs

No response

Slack discussion

No response

Troubleshooting

[X] I have read and followed the issue guidelines and this is a bug impacting only the Kubecost application.
[X] I have searched other issues in this repository and mine is not recorded.

DerekTBrown commented 5 months ago

This looks similar to the following issues, but those are fixed:

AjayTripathy commented 5 months ago

Hi @DerekTBrown we're working on a smoother way to handle this, but for now you should be able to delete and recreate the persistent volume and restart the pod to get out of this state.

DerekTBrown commented 5 months ago

Hi @DerekTBrown we're working on a smoother way to handle this, but for now you should be able to delete and recreate the persistent volume and restart the pod to get out of this state.

Done, and that did seem to resolve the issue.

Is the plan to just have something that deletes the cache if it becomes corrupted?

AjayTripathy commented 5 months ago

@cliffcolvin we're pretty sure this is getting addressed in 2.3+ right?

DeepakRai94 commented 4 months ago

I had the same issue, but it started working after I recreated the aggregator-db persistent volume.

passionInfinite commented 4 months ago

Confirmed same issue with v2.2.4.

teevans commented 4 months ago

Hey there, this should be resolved in our 2.3 release. We're planning on releasing 2.3.2 sometime today or tomorrow and recommend upgrading to that when it's ready!

timchenko-a commented 2 months ago

Not sure if this is the same exact issue, but we've hit something similar with 2.3.5:

panic: failed to create ingestor: Ingestor: error creating db: setting up migrations: opening '/var/configs/waterfowl/duckdb/v0_10_3/kubecost.duckdb.write': database/sql/driver: could not open database: duckdb error: IO Error: Corrupt database file: computed checksum 4178360413824115490 does not match stored checksum 16005271743778032503 in block at location 34877440

AjayTripathy commented 2 months ago

Could you try and recreate the aggregator db volume and let me know if that works @timchenko-a

igorbrites commented 1 month ago

Bit different message, but recreating the aggregator db PVC solves the issue (added some logs before the error for context):

2024-10-04T12:07:32.611615117Z INF Copy starting
2024-10-04T12:07:32.622755633Z ERR Failed to get migrate version: no migration
2024-10-04T12:08:37.152210689Z INF Copy finished
2024-10-04T12:08:37.381023796Z INF Ingestion starting
2024-10-04T12:08:37.38826687Z INF Using default file store as data source
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3099b71]

goroutine 2052 [running]:
github.com/kubecost/kubecost-cost-model/pkg/duckdb/internal/bun.(*Writer).InsertCloudCosts(0xc0000b4790, {0x6690990, 0xc004a41e90}, 0xc0040ad110, 0xc001064800)
        /app/kubecost-cost-model/pkg/duckdb/internal/bun/writer.go:704 +0x1d1
github.com/kubecost/kubecost-cost-model/pkg/duckdb/internal/cloudcost.(*Ingestor).run.func1(0xc0040ad110)
        /app/kubecost-cost-model/pkg/duckdb/internal/cloudcost/investor.go:201 +0x10d6
github.com/opencost/opencost/core/pkg/util/worker.(*queuedWorkerPool[...]).worker(0x0)
        /app/opencost/core/pkg/util/worker/worker.go:117 +0x42
created by github.com/opencost/opencost/core/pkg/util/worker.NewWorkerPool[...] in goroutine 1648
        /app/opencost/core/pkg/util/worker/worker.go:72 +0x13d

It's not ideal to delete the PVC every time duckdb crashes, though I have no idea why it crashes in the first place.

Using v2.3.4.

AjayTripathy commented 1 month ago

@igorbrites can you try the latest version and let me know if this persists?

chipzoller commented 1 month ago

Hello, in an effort to consolidate our bug and feature request tracking, we are deprecating using GitHub to track tickets. If this issue is still outstanding and you have not done so already, please raise a request at https://support.kubecost.com/.

kubecost / features-bugs