[Bug] Aggregator crashing on Kubecost

aaj-synth commented 7 months ago

Kubecost Helm Chart Version

2.2.0

Kubernetes Version

1.29

Kubernetes Platform

EKS

Description

While trying to update kubecost from v2.1.0 to v2.2.0, the kubecost-analyzer pod's container aggregator started going into CrashLoopBackOff with the error pasted below.

Steps to reproduce

Update the kubecost helm chart from v2.1.0 to v2.2.0

Expected behavior

It was expected to update successfully but it threw this error.

Impact

No response

Screenshots

No response

Logs

ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x1721f75]

goroutine 49 [running]:
database/sql.(*DB).Close(0x0)
    /usr/local/go/src/database/sql/sql.go:910 +0x35
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.startIngestor(0xc0001ca600, 0xc001485960)
    /app/kubecost-cost-model/pkg/duckdb/write/writer.go:325 +0x28
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.NewWriter.func5({0x4840ae0?, 0xc000c30060?}, 0xc001569568?)
    /app/kubecost-cost-model/pkg/duckdb/write/writer.go:183 +0x1b
github.com/looplab/fsm.(*FSM).enterStateCallbacks(0xc000c44000, {0x6017be8, 0xc000c34af0}, 0xc00165b7a0)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:470 +0x82
github.com/looplab/fsm.(*FSM).Event.(*FSM).Event.func2.func3()
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:363 +0x150
github.com/looplab/fsm.transitionerStruct.transition(...)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:422
github.com/looplab/fsm.(*FSM).doTransition(...)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:407
github.com/looplab/fsm.(*FSM).Event(0xc000c44000, {0x60177f0, 0x861da40}, {0x4eac1e4, 0xd}, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:390 +0x80a
github.com/kubecost/kubecost-cost-model/pkg/duckdb/write.NewWriter(0xc001485960, {0xc0013f54c0, 0x3a}, {0xc0013f55c0, 0x39})
    /app/kubecost-cost-model/pkg/duckdb/write/writer.go:241 +0x725
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.createWriter(0xc001485900)
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:399 +0x33
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func7({0x4840ae0?, 0xc00155a930?}, 0xc000c42000)
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:213 +0x25
github.com/looplab/fsm.(*FSM).enterStateCallbacks(0xc001560500, {0x6017be8, 0xc000c34050}, 0xc000c42000)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:470 +0x82
github.com/looplab/fsm.(*FSM).Event.(*FSM).Event.func2.func3()
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:363 +0x150
github.com/looplab/fsm.transitionerStruct.transition(...)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:422
github.com/looplab/fsm.(*FSM).doTransition(...)
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:407
github.com/looplab/fsm.(*FSM).Event(0xc001560500, {0x60177f0, 0x861da40}, {0x4edebc3, 0x1b}, {0x0, 0x0, 0x0})
    /go/pkg/mod/github.com/looplab/fsm@v1.0.1/fsm.go:390 +0x80a
github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func6.1()
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:205 +0x3e
created by github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.NewOrchestrator.func6 in goroutine 1
    /app/kubecost-cost-model/pkg/duckdb/orchestrator/orchestrator.go:204 +0x4e8

Slack discussion

No response

Troubleshooting

[X] I have read and followed the issue guidelines and this is a bug impacting only the Helm chart.
[X] I have searched other issues in this repository and mine is not recorded.

cliffcolvin commented 7 months ago

@aaj-synth thank you for reporting this issue, I've got an engineer looking at this today.

passionInfinite commented 7 months ago

Same thing for me as well but with different error: /var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied

Full Error:

2024-04-04T15:16:31.847772518Z ERR entering state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.d │
│ uckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied                     │
│ 2024-04-04T15:16:31.847814718Z ERR after event, current state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost- │
│ 1712243791.duckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied         │
│ 2024-04-04T15:16:31.847829319Z ERR error submitting event: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read': could not op │
│ en database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied                                              │
│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                 │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2f72575]                                                                                                 │
│                                                                                                                                                                         │
│ goroutine 52 [running]:                                                                                                                                                 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.(*DuckDBProvider).NewSelect(0xc000bf63c0?)                                                              │
│     /app/kubecost-cost-model/pkg/duckdb/orchestrator/duckdbprovider.go:63 +0x35                                                                                         │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).buildAbandonedWorkloadsCTE(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0 │
│ , 0x0, 0xc00111ad80})                                                                                                                                                   │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:212 +0x386                                                                                  │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).QueryAbandonedWorkloadsTopLine(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, │
│  0x0, 0x0, 0x0})                                                                                                                                                        │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:29 +0x185                                                                                   │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation.(*DuckDBAllocationQueryService).GetAllAbandonedWorkloadsTopLine(0xc0000d6d48?, {0x2, 0x1f4, 0x0, {0x0, 0x │
│ 0}, 0x0, 0x0, 0x0})                                                                                                                                                     │
│     /app/kubecost-cost-model/pkg/duckdb/allocation/abandonedworkloads.go:11 +0x76                                                                                       │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).FindAbandonedWorkloadsTopLine(0x0?, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0, 0x0, 0x0 │
│ })                                                                                                                                                                      │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:191 +0x77                                                                                               │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).summarizeAbandonedWorkloads(0x0?, 0x0?)                                         │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:237 +0x99                                                                                               │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshSummaryCache.func1()                                                     │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:66 +0x1b                                                                                                │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshIndividualMetric(0xc00120ab40, {0xc0006a8220, 0x1e}, 0xc0000d6fa0)       │
│     /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:104 +0x86

AjayTripathy commented 7 months ago

@passionInfinite @aaj-synth curious if a downgrade to 2.1 resolves the issue?

aaj-synth commented 7 months ago

Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.

passionInfinite commented 7 months ago

Downgrading to v2.1.0 produces this error:

2024/04/04 16:44:37 maxprocs: Updating GOMAXPROCS=10: determined from CPU quota                                                                                         │
│ 2024-04-04T16:44:37.889918371Z ??? Log level set to info                                                                                                                │
│ 2024-04-04T16:44:37.889949771Z INF tracing disabled                                                                                                                     │
│ 2024-04-04T16:44:37.890124073Z ERR AllocationReportFileStore: error creating file store: open /var/configs/reports.json: permission denied                              │
│ 2024-04-04T16:44:37.890294475Z ERR creating file store: open /var/configs/asset-reports.json: permission denied                                                         │
│ 2024-04-04T16:44:37.890402276Z ERR AdvancedReportFileStore: error creating file store: open /var/configs/advanced-reports.json: permission denied                       │
│ 2024-04-04T16:44:37.890489277Z ERR CloudCostFileStore: error creating file store: open /var/configs/cloud-cost-reports.json: permission denied                          │
│ 2024-04-04T16:44:37.890535778Z ERR RecurringBudgetRuleFileStore: error writing file store: open /var/configs/recurring-budget-rules.json: permission denied             │
│ 2024-04-04T16:44:37.890594378Z ERR BudgetFileStore: error writing file store: open /var/configs/budgets.json: permission denied                                         │
│ 2024-04-04T16:44:37.890650979Z ERR Team.FileStore: error creating file store: open /var/configs/teams.json: permission denied                                           │
│ 2024-04-04T16:44:37.890681179Z ERR User.FileStore: error creating file store: open /var/configs/users.json: permission denied                                           │
│ 2024-04-04T16:44:37.89071078Z ERR Auth.ServiceAccountFileStore: error creating file store: open /var/configs/serviceAccounts.json: permission denied                    │
│ 2024-04-04T16:44:37.890794881Z ERR entering state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/wat │
│ 2024-04-04T16:44:37.890808981Z ERR after event, current state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var │
│ 2024-04-04T16:44:37.890818581Z ERR error submitting event: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/waterfowl/duckdb/v0_9_2: per │
│ 2024-04-04T16:44:37.890951182Z ERR error initializing file store: failed to wrtie to file: open /var/configs/collections.json: permission denied                        │
│ 2024-04-04T16:44:37.891031283Z ERR Failed to write trial status: open /var/configs/trialuser.kc: permission denied                                                      │
│ Error: initializing: failed to start enterprise trial: FailedToWriteTrialStatus

passionInfinite commented 7 months ago

Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.

@aaj-synth For me it is still failing for 1.108.1 . Is it something that you did and it started working? Looks like the pv files are corrupted 🤔

AjayTripathy commented 7 months ago

Hi @passionInfinite looks like a seperate issue with permissions on the PV? Can you open a ticket with support?

aaj-synth commented 7 months ago

I upgraded from v1.108.0 to v2.1.0 and it worked fine. In the meantime i saw the blog post about v2.2.0 being released and soon as i upgraded to that, things stopped working. I tried downgrading to v2.1.0 but that ran in the same error that i mentioned in the issue. I eventually downgraded to v1.108.0 and just removed the upgrade.toV2 flag from helm chart and it worked for me.

rahul-chr commented 7 months ago

I can confirm i am facing this too on kubecost 2.1, i did a upgrade from 1.103.5 to 2.1.1

ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version. panic: runtime error: invalid memory address or nil pointer dereference

michaelmdresser commented 7 months ago

@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?

michaelmdresser commented 7 months ago

@aaj-synth Downgrades can sometimes be tricky when going between particular version of v2.X. We're working on making this not a problem. While we're working on that, if you'd like to get back to v2.1 or try getting onto v2.2 again, please remove the /var/configs/waterfowl folder from your kubecost-cost-analyzer PVC before upgrading to your desired version. I have reason to believe the DB file got into a bad state and needs manual intervention. This does not cause data loss.

The command you would run is this, assuming Kubecost is installed in the kubecost namespace: kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=cost-analyzer -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl

@rahul-chr I'm not confident that your problem is the same as @aaj-synth's problem. If you're willing to experiment, trying the same command above might help you, but it also might not.

michaelmdresser commented 7 months ago

Also, @aaj-synth and @rahul-chr do Kubecost's PV(C)s have enough space on them? Are any of them filling up or full?

passionInfinite commented 7 months ago

@michaelmdresser For my case, I found out that the PV mounted folder permissions got changed to root for some reason but the newer version uses the fsGroup of 1001 hence the permission denied errors?

@michaelmdresser By anychance etlUtils runs as root? 🤔

passionInfinite commented 7 months ago

@michaelmdresser I attached the volume to another test pod and checked the permissions of the /var/configs and it was root instead of 1001. I think how we upgrade matters over here. Going directly is not at going to work because some version includes the securityContext implementation which changes the permission from root to 1001. From my experience what I have noticed the upgrade path that will work:

v1.106.5 (Current Version) -> v1.107.1 v1.107.1 -> v1.108.1 ---> This includes the contextSecurity changes to 1001 v1.108.1 -> v2.1.0 --> Initial migration with Kubecost Aggregator using DuckDB v2.1.0 -> v2.2.0 (Target Version) --> Migration schema changes.

Please correct me if something is wrong in my point of view.

michaelmdresser commented 7 months ago

@passionInfinite Thank you for the extra information, please open a separate issue to track the file permission problems you have encountered. We are using this issue to track the original issue and related problems: ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version.

michaelmdresser commented 7 months ago

I attempted an upgrade directly from v1.103.5 to v2.1.1 without incident. I suspect this issue is limited to situations where downgrades have occurred.

rahul-chr commented 7 months ago

@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?

@michaelmdresser yes that was directl upgrade.. no downgrades. and also the o/p

Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

michaelmdresser commented 7 months ago

yes that was directl upgrade.. no downgrades.

Fascinating, we're trying to look further into this.

Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl

rahul-chr commented 7 months ago

yes that was directl upgrade.. no downgrades.

Fascinating, we're trying to look further into this.

Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1

@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl

Thank you @michaelmdresser for your response! But looks like this isnt helping either..

kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}')

error: unable to upgrade connection: container not found ("aggregator")

michaelmdresser commented 7 months ago

@rahul-chr Ah, shucks. I'm guessing that's because its crash looping. To exec into the Pod to run the recovery despite the crash loop, we're going to have to do this:

Edit the Aggregator StatefulSet via kubectl edit, e.g. kubectl edit statefulset -n kubecost kubecost-aggregator.

Add the following right underneath name: aggregator in the Pod spec inside the StatefulSet:
```
command:
 - /bin/bash
 - -c
 - |
    sleep 36000;
```
This will start the pod up in sleeping mode and will not start the app, meaning it will not crash.
After saving the edits, the kubecost-aggregator-0 Pod should terminate and restart
Check the logs on the kubecost-aggregator-0 Pod to ensure there are no logs. This is expected, because it is sleeping.
Run the command I sent earlier: kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl/duckdb
Remove the command: block added in step 1. Aggregator should restart with normal log behavior.

I apologize for the trouble here. This is an unusual error situation.

rahul-chr commented 7 months ago

@michaelmdresser i think still there is obvious problem

rm: cannot remove '/var/configs/waterfowl/duckdb': Device or resource busy command terminated with exit code 1

But i have tweaked it more, i have removed the below mountPath , as it was mounted and used by PVC. I was able to delete the directory then and later have added it back ;)

    - name: aggregator-db-storage
      mountPath: /var/configs/waterfowl/duckdb
   - name: aggregator-staging
     mountPath: /var/configs/waterfowl

It works now

also, do you think this is a potential bug with this upgrade ?

michaelmdresser commented 7 months ago

But i have tweaked it more,

Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.

It works now

The command works, great! After removing the sleep, has Aggregator started up normally without the crash behavior?

also, do you think this is a potential bug with this upgrade ?

Is this question about the original bit of this GH Issue, which is migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.

rahul-chr commented 7 months ago

But i have tweaked it more,

Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.

It works now

The command works, great! After removing the sleep, has Aggregator started up normally without the crash behavior?

also, do you think this is a potential bug with this upgrade ?

Is this question about the original bit of this GH Issue, which is migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.

Nope, this is specific to my issue, do you want me to open an github issue for that? As i am afraid, if i can do this workaround(removing duckdb) in production?

michaelmdresser commented 7 months ago

Nope, this is specific to my issue, do you want me to open an github issue for that?

If you're running into a new bug, please do open a new issue.

As i am afraid, if i can do this workaround(removing duckdb) in production?

Don't worry! DuckDB files are not a "source of truth" -- Aggregator builds up its datastore from what we call "ETL" files which are stored either in object storage (e.g. S3, GCS) or in a different folder in the PV, depending on your configuration. Removing the /var/configs/waterfowl/duckdb directory will indeed cause a rebuild, but all of the ETL data it builds from will not be affected so it will get you right back to where you should be once the rebuild completes. No data loss.

chipzoller commented 6 months ago

Does not appear to be an issue with the Helm chart. Transferred to the correct repository.

TomHellier commented 5 months ago

@AjayTripathy - This is marked as completed - do you know what version a fix was released in? thanks :)

AjayTripathy commented 5 months ago

2.2.5 -- let me check on what's going on in #103 though.

kubecost / features-bugs