Closed aaj-synth closed 6 months ago
@aaj-synth thank you for reporting this issue, I've got an engineer looking at this today.
Same thing for me as well but with different error: /var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied
Full Error:
2024-04-04T15:16:31.847772518Z ERR entering state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.d │
│ uckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied │
│ 2024-04-04T15:16:31.847814718Z ERR after event, current state: create_read_interface_init, err: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost- │
│ 1712243791.duckdb.read': could not open database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied │
│ 2024-04-04T15:16:31.847829319Z ERR error submitting event: setting up migrations: opening '/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read': could not op │
│ en database: IO Error: Cannot open file "/var/configs/waterfowl/duckdb/kubecost-1712243791.duckdb.read": Permission denied │
│ panic: runtime error: invalid memory address or nil pointer dereference │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x2f72575] │
│ │
│ goroutine 52 [running]: │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/orchestrator.(*DuckDBProvider).NewSelect(0xc000bf63c0?) │
│ /app/kubecost-cost-model/pkg/duckdb/orchestrator/duckdbprovider.go:63 +0x35 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).buildAbandonedWorkloadsCTE(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0 │
│ , 0x0, 0xc00111ad80}) │
│ /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:212 +0x386 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation/db.(*AllocationDBQueryService).QueryAbandonedWorkloadsTopLine(0xc000f13ab8, {0x2, 0x1f4, 0x0, {0x0, 0x0}, │
│ 0x0, 0x0, 0x0}) │
│ /app/kubecost-cost-model/pkg/duckdb/allocation/db/abandonedworkloads.go:29 +0x185 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/allocation.(*DuckDBAllocationQueryService).GetAllAbandonedWorkloadsTopLine(0xc0000d6d48?, {0x2, 0x1f4, 0x0, {0x0, 0x │
│ 0}, 0x0, 0x0, 0x0}) │
│ /app/kubecost-cost-model/pkg/duckdb/allocation/abandonedworkloads.go:11 +0x76 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).FindAbandonedWorkloadsTopLine(0x0?, {0x2, 0x1f4, 0x0, {0x0, 0x0}, 0x0, 0x0, 0x0 │
│ }) │
│ /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:191 +0x77 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).summarizeAbandonedWorkloads(0x0?, 0x0?) │
│ /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:237 +0x99 │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshSummaryCache.func1() │
│ /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:66 +0x1b │
│ github.com/kubecost/kubecost-cost-model/pkg/duckdb/savings.(*DuckDBSavingsQueryService).refreshIndividualMetric(0xc00120ab40, {0xc0006a8220, 0x1e}, 0xc0000d6fa0) │
│ /app/kubecost-cost-model/pkg/duckdb/savings/queryservice.go:104 +0x86
@passionInfinite @aaj-synth curious if a downgrade to 2.1 resolves the issue?
Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.
Downgrading to v2.1.0 produces this error:
2024/04/04 16:44:37 maxprocs: Updating GOMAXPROCS=10: determined from CPU quota │
│ 2024-04-04T16:44:37.889918371Z ??? Log level set to info │
│ 2024-04-04T16:44:37.889949771Z INF tracing disabled │
│ 2024-04-04T16:44:37.890124073Z ERR AllocationReportFileStore: error creating file store: open /var/configs/reports.json: permission denied │
│ 2024-04-04T16:44:37.890294475Z ERR creating file store: open /var/configs/asset-reports.json: permission denied │
│ 2024-04-04T16:44:37.890402276Z ERR AdvancedReportFileStore: error creating file store: open /var/configs/advanced-reports.json: permission denied │
│ 2024-04-04T16:44:37.890489277Z ERR CloudCostFileStore: error creating file store: open /var/configs/cloud-cost-reports.json: permission denied │
│ 2024-04-04T16:44:37.890535778Z ERR RecurringBudgetRuleFileStore: error writing file store: open /var/configs/recurring-budget-rules.json: permission denied │
│ 2024-04-04T16:44:37.890594378Z ERR BudgetFileStore: error writing file store: open /var/configs/budgets.json: permission denied │
│ 2024-04-04T16:44:37.890650979Z ERR Team.FileStore: error creating file store: open /var/configs/teams.json: permission denied │
│ 2024-04-04T16:44:37.890681179Z ERR User.FileStore: error creating file store: open /var/configs/users.json: permission denied │
│ 2024-04-04T16:44:37.89071078Z ERR Auth.ServiceAccountFileStore: error creating file store: open /var/configs/serviceAccounts.json: permission denied │
│ 2024-04-04T16:44:37.890794881Z ERR entering state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/wat │
│ 2024-04-04T16:44:37.890808981Z ERR after event, current state: create_read_interface_init, err: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var │
│ 2024-04-04T16:44:37.890818581Z ERR error submitting event: error making directory /var/configs/waterfowl/duckdb/v0_9_2: mkdir /var/configs/waterfowl/duckdb/v0_9_2: per │
│ 2024-04-04T16:44:37.890951182Z ERR error initializing file store: failed to wrtie to file: open /var/configs/collections.json: permission denied │
│ 2024-04-04T16:44:37.891031283Z ERR Failed to write trial status: open /var/configs/trialuser.kc: permission denied │
│ Error: initializing: failed to start enterprise trial: FailedToWriteTrialStatus
Downgrading from 2.2 to 2.1 does not help. I had to go back to 1.108 to have things working again.
@aaj-synth For me it is still failing for 1.108.1 . Is it something that you did and it started working? Looks like the pv files are corrupted 🤔
Hi @passionInfinite looks like a seperate issue with permissions on the PV? Can you open a ticket with support?
I upgraded from v1.108.0 to v2.1.0 and it worked fine. In the meantime i saw the blog post about v2.2.0 being released and soon as i upgraded to that, things stopped working. I tried downgrading to v2.1.0 but that ran in the same error that i mentioned in the issue. I eventually downgraded to v1.108.0 and just removed the upgrade.toV2
flag from helm chart and it worked for me.
I can confirm i am facing this too on kubecost 2.1, i did a upgrade from 1.103.5 to 2.1.1
ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version. panic: runtime error: invalid memory address or nil pointer dereference
@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?
@aaj-synth Downgrades can sometimes be tricky when going between particular version of v2.X. We're working on making this not a problem. While we're working on that, if you'd like to get back to v2.1 or try getting onto v2.2 again, please remove the /var/configs/waterfowl
folder from your kubecost-cost-analyzer
PVC before upgrading to your desired version. I have reason to believe the DB file got into a bad state and needs manual intervention. This does not cause data loss.
The command you would run is this, assuming Kubecost is installed in the kubecost
namespace:
kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=cost-analyzer -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl
@rahul-chr I'm not confident that your problem is the same as @aaj-synth's problem. If you're willing to experiment, trying the same command above might help you, but it also might not.
Also, @aaj-synth and @rahul-chr do Kubecost's PV(C)s have enough space on them? Are any of them filling up or full?
@michaelmdresser For my case, I found out that the PV mounted folder permissions got changed to root for some reason but the newer version uses the fsGroup
of 1001
hence the permission denied errors?
@michaelmdresser By anychance etlUtils runs as root? 🤔
@michaelmdresser I attached the volume to another test pod and checked the permissions of the /var/configs
and it was root
instead of 1001
. I think how we upgrade matters over here. Going directly is not at going to work because some version includes the securityContext
implementation which changes the permission from root
to 1001
. From my experience what I have noticed the upgrade path that will work:
v1.106.5 (Current Version) -> v1.107.1 v1.107.1 -> v1.108.1 ---> This includes the contextSecurity changes to 1001 v1.108.1 -> v2.1.0 --> Initial migration with Kubecost Aggregator using DuckDB v2.1.0 -> v2.2.0 (Target Version) --> Migration schema changes.
Please correct me if something is wrong in my point of view.
@passionInfinite Thank you for the extra information, please open a separate issue to track the file permission problems you have encountered. We are using this issue to track the original issue and related problems: ERR error doing initial open of DB: error opening db at path /var/configs/waterfowl/duckdb/v0_9_2/kubecost.duckdb.write: migrating up: Dirty database version 20230712171354. Fix and force version.
I attempted an upgrade directly from v1.103.5 to v2.1.1 without incident. I suspect this issue is limited to situations where downgrades have occurred.
@rahul-chr did you upgrade directly from v1.103.5 to v2.1.1? No other upgrades/downgrades along the way before seeing that error?
@michaelmdresser yes that was directl upgrade.. no downgrades. and also the o/p
Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1
yes that was directl upgrade.. no downgrades.
Fascinating, we're trying to look further into this.
Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1
@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:
kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl
yes that was directl upgrade.. no downgrades.
Fascinating, we're trying to look further into this.
Defaulted container "cost-model" out of: cost-model, cost-analyzer-frontend rm: cannot remove '/var/configs/waterfowl': No such file or directory command terminated with exit code 1
@rahul-chr Are you using Aggregator in a StatefulSet configuration? If so, the command I gave you is slightly wrong, and needs to be modified like so:
kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl
Thank you @michaelmdresser for your response! But looks like this isnt helping either..
kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}')
error: unable to upgrade connection: container not found ("aggregator")
@rahul-chr Ah, shucks. I'm guessing that's because its crash looping. To exec
into the Pod to run the recovery despite the crash loop, we're going to have to do this:
Edit the Aggregator StatefulSet via kubectl edit
, e.g. kubectl edit statefulset -n kubecost kubecost-aggregator
.
Add the following right underneath name: aggregator
in the Pod spec
inside the StatefulSet:
command:
- /bin/bash
- -c
- |
sleep 36000;
This will start the pod up in sleeping mode and will not start the app, meaning it will not crash.
kubecost-aggregator-0
Pod should terminate and restartkubecost-aggregator-0
Pod to ensure there are no logs. This is expected, because it is sleeping.kubectl exec -it -n kubecost $(kubectl get pod -n kubecost -l app=aggregator -o jsonpath='{.items[0].metadata.name}') -- rm -r /var/configs/waterfowl/duckdb
command:
block added in step 1. Aggregator should restart with normal log behavior.I apologize for the trouble here. This is an unusual error situation.
@michaelmdresser i think still there is obvious problem
rm: cannot remove '/var/configs/waterfowl/duckdb': Device or resource busy command terminated with exit code 1
But i have tweaked it more, i have removed the below mountPath
, as it was mounted and used by PVC. I was able to delete the directory then and later have added it back ;)
- name: aggregator-db-storage
mountPath: /var/configs/waterfowl/duckdb
- name: aggregator-staging
mountPath: /var/configs/waterfowl
It works now
also, do you think this is a potential bug with this upgrade ?
But i have tweaked it more,
Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.
It works now
The command works, great! After removing the sleep
, has Aggregator started up normally without the crash behavior?
also, do you think this is a potential bug with this upgrade ?
Is this question about the original bit of this GH Issue, which is migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist
? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.
But i have tweaked it more,
Ah, thanks for the reminder about that bit of the volume configuration. Thanks for your patience.
It works now
The command works, great! After removing the
sleep
, has Aggregator started up normally without the crash behavior?also, do you think this is a potential bug with this upgrade ?
Is this question about the original bit of this GH Issue, which is
migrating up: no migration found for version 20240306133000: read down for version 20240306133000 migrations: file does not exist
? I think so, given that we've seen a few reports of it so far. It's a bit troubling for me, as I haven't been able to reproduce it yet with anything except a downgrade.
Nope, this is specific to my issue, do you want me to open an github issue for that? As i am afraid, if i can do this workaround(removing duckdb) in production?
Nope, this is specific to my issue, do you want me to open an github issue for that?
If you're running into a new bug, please do open a new issue.
As i am afraid, if i can do this workaround(removing duckdb) in production?
Don't worry! DuckDB files are not a "source of truth" -- Aggregator builds up its datastore from what we call "ETL" files which are stored either in object storage (e.g. S3, GCS) or in a different folder in the PV, depending on your configuration. Removing the /var/configs/waterfowl/duckdb
directory will indeed cause a rebuild, but all of the ETL data it builds from will not be affected so it will get you right back to where you should be once the rebuild completes. No data loss.
Does not appear to be an issue with the Helm chart. Transferred to the correct repository.
@AjayTripathy - This is marked as completed - do you know what version a fix was released in? thanks :)
2.2.5 -- let me check on what's going on in #103 though.
Kubecost Helm Chart Version
2.2.0
Kubernetes Version
1.29
Kubernetes Platform
EKS
Description
While trying to update kubecost from v2.1.0 to v2.2.0, the
kubecost-analyzer
pod's containeraggregator
started going intoCrashLoopBackOff
with the error pasted below.Steps to reproduce
Expected behavior
It was expected to update successfully but it threw this error.
Impact
No response
Screenshots
No response
Logs
Slack discussion
No response
Troubleshooting