[Bug] Allocation query fails due to a max concurrency semaphore error

DamienMatias commented 2 months ago

Kubecost Version

2.3.2

Kubernetes Version

1.28

Kubernetes Platform

EKS

Description

When calling the allocation API, it returns the following error :

{
    "code": 500,
    "data": null,
    "message": "error querying DB: error querying via querysvc append row Failure: acquiring max concurrency semaphore: context canceled"
}

I can see the same error in the logs of the aggregator container

2024-07-11T12:41:01.327494391Z DBG NetworkInsight: Ingestor: updates flusher stopped
2024-07-11T12:41:01.327513703Z DBG NetworkInsight: Ingestor: completed run with 0 record updates
2024-07-11T12:41:01.43637521Z DBG store completed records
2024-07-11T12:41:01.436434133Z DBG completed records stored
2024-07-11T12:41:01.436386607Z DBG Allocation: Ingestor: updates flusher stopped
2024-07-11T12:41:01.436448812Z DBG Allocation: Ingestor: completed run with 0 allocations added
2024-07-11T12:41:01.442933662Z DBG store completed records
2024-07-11T12:41:01.442985026Z DBG completed records stored
2024-07-11T12:41:01.443000735Z DBG Asset: Ingestor: completed run with 0 assets ingested
2024-07-11T12:41:01.443021613Z DBG Asset: Ingestor: updates flusher stopped
2024-07-11T12:42:00.00062016Z DBG checking scheduled reports at time: Thu Jul 11 12:42:00 UTC 2024
2024-07-11T12:43:00.000163125Z DBG checking scheduled reports at time: Thu Jul 11 12:43:00 UTC 2024
2024-07-11T12:43:42.089589365Z DBG http: named cookie not present
2024-07-11T12:43:42.08964877Z DBG Auth.Groups: http: named cookie not present
2024-07-11T12:43:42.089688765Z DBG Using V1 filter language for query
2024-07-11T12:43:50.326999191Z DBG http: named cookie not present
2024-07-11T12:43:50.327117389Z DBG Auth.Groups: http: named cookie not present
2024-07-11T12:43:50.327250322Z DBG Using V1 filter language for query
2024-07-11T12:44:00.000097434Z DBG checking scheduled reports at time: Thu Jul 11 12:44:00 UTC 2024
2024-07-11T12:44:42.119047842Z DBG http: named cookie not present
2024-07-11T12:44:42.119127545Z DBG Auth.Groups: http: named cookie not present
2024-07-11T12:44:42.119161615Z DBG Using V1 filter language for query
2024-07-11T12:44:42.327072475Z ERR error querying via querysvc append row Failure: acquiring max concurrency semaphore: context canceled
2024-07-11T12:44:42.32772265Z ERR error querying via querysvc append row Failure: acquiring max concurrency semaphore: context canceled
2024-07-11T12:44:42.328931514Z ERR error querying via querysvc append row Failure: acquiring max concurrency semaphore: context canceled
2024-07-11T12:44:42.328990395Z ERR error querying via querysvc append row Failure: acquiring max concurrency semaphore: context canceled

The error doesn't happen every time, it looks it's not handling well when there are multiple queries at the same time so I tried increasing the resources available for the aggregator container but there is no impact.

I'll add this error appeared a few days after upgrading from 1.108.1 to 2.3.2.

Steps to reproduce

Helm chart values

  global:
    prometheus:
      enabled: false
      fqdn: http://prometheus-server.kube-system.svc
    amp:
      enabled: true
      prometheusServerEndpoint: http://localhost:8005/workspaces/...
    grafana:
      enabled: false
      proxy: false
    networkCosts:
      enabled: true
      config:
        services:
          amazon-web-services: true
    kubecostProductConfigs:
      clusterName: ${CLUSTER_NAME}
      projectID: ${AWS_ACCOUNT_ID}
      awsSpotDataRegion: ${REGION}
      awsSpotDataBucket: example-${REGION}
      spotLabel: karpenter.sh/capacity-type
      spotLabelValue: spot
      gpuLabel: Group
      gpuLabelValue: ComputeGPU
    kubecostModel:
      etlBucketConfigSecret: example
      extraEnv:
      - name: LOG_LEVEL
        value: debug
    ingress:
      enabled: true
      className: nginx
      annotations: null
      paths: ["/"]
      pathType: Prefix
      hosts:
        - ...
    sigV4Proxy:
      region: ${REGION}
      host: aps-workspaces.${REGION}.amazonaws.com
    kubecostAggregator:
      logLevel: debug
      dbReadThreads: 4
      dbWriteThreads: 4
      resources:
        requests:
          cpu: 2
          memory: 6Gi
        limits:
          cpu: 4
          memory: 12Gi
    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: ...

I'm making this GET request

GET /model/allocation?window=7d&shareIdle=false&aggregate=label:env,label:project,label:app,job,label:user_name,pod HTTP/1.1
Host: kubecost.whatever.com

but when I am changing the window to 1d, it doesn't fail anymore 🤷‍♂️

Expected behavior

I would expect the API to return a 200 alongside the data.

Impact

We are running hourly this request for store that into our analytics backend.

Screenshots

No response

Logs

No response

Slack discussion

No response

Troubleshooting

[X] I have read and followed the issue guidelines and this is a bug impacting only the Kubecost application.
[X] I have searched other issues in this repository and mine is not recorded.

chipzoller commented 3 weeks ago

cc @cliffcolvin

cliffcolvin commented 2 weeks ago

@DamienMatias this should be fixed in 2.3.5-rc.10.

We are working to get full 2.3.5 out and doing last qa efforts today hopefully. If you can take the rc it should resolve and if not grab 2.3.5 very soon!

kubecost / features-bugs