thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

store: won't start, no logs indicating why #1455

Closed asmith60 closed 4 years ago

asmith60 commented 4 years ago

Thanos, Prometheus and Golang version used Thanos: 0.6.0 Prometheus: 2.10.0

What happened The Thanos store won't start. It tries to start up, but crashes in ~30 seconds. Inspecting the pod indicates that the process exited with a non-zero code. The log output with debug enabled is below.

level=info ts=2019-08-23T18:55:30.952906789Z caller=main.go:154 msg="Tracing will be disabled"
level=info ts=2019-08-23T18:55:30.952955736Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-08-23T18:55:30.969576127Z caller=cache.go:172 msg="created index cache" maxItemSizeBytes=4294967296 maxSizeBytes=8589934592 maxItems=math.MaxInt64
level=debug ts=2019-08-23T18:55:30.969822743Z caller=store.go:144 msg="initializing bucket store"

What you expected to happen Thanos store to start successfully.

Anything else we need to know 6 HA pairs of Prometheus instances (12 total instances) are uploading metrics to the AWS S3 bucket. The current bucket size is ~750GB. The store pod manifest is below (I removed the obj-store config, AWS IAM config, etc)

Store pod

``` apiVersion: apps/v1 kind: StatefulSet metadata: name: thanos-store namespace: monitoring labels: app: thanos-store spec: replicas: 3 selector: matchLabels: app: thanos-store serviceName: thanos-store template: metadata: labels: app: thanos-store spec: containers: - name: thanos-store imagePullPolicy: Always image: "improbable/thanos:v0.6.0" args: - store - --data-dir=/data - --log.level=debug - --index-cache-size=8GB - --chunk-pool-size=20GB ports: - name: http containerPort: 10902 protocol: TCP - name: grpc containerPort: 10901 protocol: TCP livenessProbe: httpGet: path: /metrics port: http readinessProbe: httpGet: path: /metrics port: http resources: limits: cpu: 2000m memory: 32000Mi requests: cpu: 2000m memory: 32000Mi volumeMounts: - mountPath: /data name: storage-volume volumeClaimTemplates: - metadata: name: storage-volume spec: accessModes: - ReadWriteOnce resources: requests: storage: "128Gi" ```

anoop2503 commented 4 years ago

Hi, any update on this issue?

I am also getting the same issue that thanos store gateway is stuck with "initializing bucket store" when starting the container. No other warning/error is appearing in the log. Any idea why this is happening or how to find out the root cause of this issue?

The logs are given below: level=info ts=2019-09-05T14:37:53.221491945Z caller=flags.go:75 msg="gossip is disabled" level=info ts=2019-09-05T14:37:53.222294564Z caller=factory.go:39 msg="loading bucket configuration" level=debug ts=2019-09-05T14:37:53.223374047Z caller=store.go:128 msg="initializing bucket store"

Thanks,

bwplotka commented 4 years ago

Sorry for delay!

Store Gateway Startup grabs portion of the objects into memory and thus if you don't have compactor (do you have it? Is it working?) it will be quite a long process, plus memory intensive.

Most likely Store is just OOMing for your case. Give more memory, time shard store gateway (see: https://github.com/thanos-io/thanos/pull/1077), or add compactor if missing (!).

Things which we are planning to do:

asmith60 commented 4 years ago

@anoop2503 I just needed to give the store more time to startup (about 5 minutes in my case). It seems that the more memory I feed the store the less time it takes to start.

GiedriusS commented 4 years ago

Also, we could and should probably be more verbose here at the debug level (or info) so that users would know what blocks we are pulling just like Prometheus, for example, prints what blocks it finds on the disk.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.