thanos-io / objstore

Go module providing unified interface and efficient clients to work with various object storage providers until like GCS, S3, Azure, SWIFT, COS and more.
Apache License 2.0
106 stars 74 forks source link

error uploading prometheus blocks to to azure blob store via sidecar #24

Closed hemuvemula closed 1 year ago

hemuvemula commented 2 years ago

versions

Thanos: v0.27.0

Prometheus: v2.36.2

Environment

AKS - 1.22.11

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    component: "server"
    app: prometheus
    release: prometheus
    chart: prometheus-15.11.0
    heritage: Helm
  name: prometheus-server
  namespace: monitoring
spec:
  serviceName: prometheus-server-headless
  selector:
    matchLabels:
      component: "server"
      app: prometheus
      release: prometheus
  replicas: 1
  podManagementPolicy: OrderedReady
  template:
    metadata:
      labels:
        component: "server"
        app: prometheus
        release: prometheus
        chart: prometheus-15.11.0
        heritage: Helm
    spec:
      priorityClassName: "system-node-critical"
      enableServiceLinks: true
      serviceAccountName: prometheus-server
      containers:

        - name: prometheus-server
          image: "prometheus/prometheus:v2.36.2"
          imagePullPolicy: "IfNotPresent"
          securityContext:
            {}
          env:
            - name: CLUSTER_NAME
              value: sandbox
          args:
            - --storage.tsdb.retention.time=7d
            - --config.file=/etc/config/prometheus.yml
            - --storage.tsdb.path=/data
            - --web.console.libraries=/etc/prometheus/console_libraries
            - --web.console.templates=/etc/prometheus/consoles
            - --web.enable-lifecycle
            - --web.enable-admin-api
            - --storage.tsdb.max-block-duration=2h
            - --storage.tsdb.min-block-duration=2h
          ports:
            - containerPort: 9090
          readinessProbe:
            httpGet:
              path: /-/ready
              port: 9090
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 5
            timeoutSeconds: 4
            failureThreshold: 3
            successThreshold: 1
          livenessProbe:
            httpGet:
              path: /-/healthy
              port: 9090
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 15
            timeoutSeconds: 10
            failureThreshold: 3
            successThreshold: 1
          resources:
            limits:
              memory: 106Gi
            requests:
              cpu: 23
              memory: 106Gi
          volumeMounts:
            - name: config-volume
              mountPath: /etc/config
            - name: storage-volume
              mountPath: /data
              subPath: ""
        - name: thanos-sidecar
          args:
          - sidecar
          - --log.level=debug
          - --tsdb.path=/data/
          - --prometheus.url=http://127.0.0.1:9090
          - --objstore.config-file=/etc/thanos-object-store-config/blobStore.yml
          - --reloader.config-file=/etc/config/prometheus.yml
          - --reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yml
          - --reloader.rule-dir=/etc/config/rules
          image: quay.io/thanos/thanos:v0.27.0
          ports:
          - containerPort: 10902
            name: sidecar-http
          - containerPort: 10901
            name: grpc
          - containerPort: 10900
            name: cluster
          resources:
            limits:
              cpu: "2"
              memory: 4Gi
            requests:
              cpu: "2"
              memory: 4Gi
          volumeMounts:
          - mountPath: /data
            name: storage-volume
          - mountPath: /etc/config
            name: config-volume
            readOnly: false
          - mountPath: /etc/prometheus-shared/
            name: prometheus-config-shared
            readOnly: false
          - mountPath: /etc/thanos-object-store-config/
            name: thanos-object-store-config
            readOnly: false
      hostNetwork: false
      dnsPolicy: ClusterFirst
      nodeSelector:
        nodeClass: monitoring
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
      tolerations:
        - effect: NoSchedule
          key: nodeClass
          operator: Equal
          value: monitoring
      terminationGracePeriodSeconds: 300
      volumes:
        - name: config-volume
          configMap:
            name: prometheus-server
        - emptyDir: {}
          name: prometheus-config-shared
        - configMap:
            name: thanos-object-store-config-map
          name: thanos-object-store-config
  volumeClaimTemplates:
    - metadata:
        name: storage-volume
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "1Ti"

Object Store Config

type: AZURE
config:
  storage_account: 'someobjectstore'
  storage_account_key: 'somesupersecretkey'
  container: 'somecontainername'
  endpoint: 'blob.core.windows.net'

thanos Logs:

level=info ts=2022-08-25T20:47:15.050946476Z caller=grpc.go:131 service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2022-08-25T20:47:15.051145277Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-08-25T20:47:15.051170677Z caller=http.go:73 service=http/server component=sidecar msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2022-08-25T20:47:15.051218278Z caller=tls_config.go:195 service=http/server component=sidecar msg="TLS is disabled." http2=false
level=debug ts=2022-08-25T20:47:15.052403388Z caller=promclient.go:623 msg="build version" url=http://127.0.0.1:9090/api/v1/status/buildinfo
level=info ts=2022-08-25T20:47:15.052956392Z caller=sidecar.go:179 msg="successfully loaded prometheus version"
level=info ts=2022-08-25T20:47:15.05638812Z caller=reloader.go:375 component=reloader msg="Reload triggered" cfg_in=/etc/config/prometheus.yml cfg_out=/etc/prometheus-shared/prometheus.yml watched_dirs=/etc/config/rules
level=info ts=2022-08-25T20:47:15.056457621Z caller=reloader.go:236 component=reloader msg="started watching config file and directories for changes" cfg=/etc/config/prometheus.yml out=/etc/prometheus-shared/prometheus.yml dirs=/etc/config/rules
level=info ts=2022-08-25T20:47:15.056485521Z caller=sidecar.go:201 msg="successfully loaded prometheus external labels" external_labels="{prometheus_group=\"sandbox\", prometheus_replica=\"$(HOSTNAME)\"}"
level=warn ts=2022-08-25T20:47:17.051936943Z caller=shipper.go:239 msg="reading meta file failed, will override it" err="failed to read /data/thanos.shipper.json: open /data/thanos.shipper.json: no such file or directory"
level=debug ts=2022-08-25T20:47:17.053628057Z caller=azure.go:381 msg="check if blob exists" blob=01GBAR8SGF29B024GFAHH9VA5E/meta.json
level=info ts=2022-08-25T20:47:17.056943484Z caller=shipper.go:334 msg="upload new block" id=01GBAR8SGF29B024GFAHH9VA5E
level=debug ts=2022-08-25T20:47:17.061069018Z caller=azure.go:396 msg="Uploading blob" blob=01GBAR8SGF29B024GFAHH9VA5E/chunks/000001
panic: send on closed channel

goroutine 176 [running]:
github.com/Azure/azure-storage-blob-go/azblob.staticBuffer.Put({0xc000b2b560, 0xc0009c80d0, 0xc000b2b500}, {0xc000e80000, 0xc0005efea8, 0x0})
/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.13.0/azblob/highlevel.go:427 +0x3c
github.com/Azure/azure-storage-blob-go/azblob.(*copier).write(0xc0001fad00, {{0xc000e80000, 0x300000, 0x300000}, {0xc0007e8780, 0x58}})
/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.13.0/azblob/chunkwriting.go:166 +0x347
github.com/Azure/azure-storage-blob-go/azblob.(*copier).sendChunk.func1()
/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.13.0/azblob/chunkwriting.go:136 +0xb3
github.com/Azure/azure-storage-blob-go/azblob.NewStaticBuffer.func1()
/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.13.0/azblob/highlevel.go:406 +0x3b
created by github.com/Azure/azure-storage-blob-go/azblob.NewStaticBuffer
/go/pkg/mod/github.com/!azure/azure-storage-blob-go@v0.13.0/azblob/highlevel.go:404 +0xd3

Issue and Analysis:

Thanos sidecar panics when trying to upload data to azure blob store. this is reproducible on 0.27.0, 0.26.0, 0.25.2 as well. all these releases seem to be using github.com/Azure/azure-storage-blob-go of version v0.13.0 which seems to the source of the issue.

In the recently released v0.28.0-rc.0 I am not sure if the above issue is fixed but i do see v0.28.0-rc.0 is vendoring github.com/thanos-io/objstore v0.0.0-20220715165016-ce338803bc1e which then is using github.com/Azure/azure-storage-blob-go of version 0.14.0 which in theory should have fixed this issue but in reality I am not seeing any error but also not seeing data being uploaded.

the sidecar process seems to hang as the log statement from the following line is the last message i am seeing in logs.

I am trying to run in local and continue to debug this issue any help from the community is more than appreciated.

cc: @vglafirov

phillebaba commented 2 years ago

@hemuvemula Tanos v0.28 is not using the latest objstore version with the Azure refactor. I will try to get it in the next release and then you can loop back if the issues still persist.

hemuvemula commented 2 years ago

Thank you

phillebaba commented 2 years ago

I have created a PR thanos-io/thanos/pull/5707 to update the objstore version in Thanos now. Hopefully this will solve your problems when 0.29 is released.

phillebaba commented 1 year ago

@hemuvemula could you try the latest RC release of Thanos and see if it solves your issues. If it does you could close this issue.

hemuvemula commented 1 year ago

Thank you. Verified, it works like a charm now.

matej-g commented 1 year ago

Thanks for the great work @phillebaba, thanks for testing it out @hemuvemula! Closing this now.