grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.13k stars 529 forks source link

Helm: Support Generic ephemeral inline volumes #3453

Open paalkr opened 1 year ago

paalkr commented 1 year ago

Is your feature request related to a problem? Please describe.

Currently, if you disable persistentVolume for any component, like the ingesters, compactor, ruler or the store-gateway an emptyDir volume will automatically be mounted at /data. https://github.com/grafana/mimir/blob/d8bb72b9ee4e65e3225deec2b26c7f5f649f45f8/operations/helm/charts/mimir-distributed/templates/ingester/ingester-statefulset.yaml#L92-L95

Describe the solution you'd like

Some cloud providers do support node local ephemeral volumes, this is cheap and very high-performance disks. A CIS driver can be deployed to enable dynamic provisioning of PVCs on top of node local ephemeral volumes. Using node local storage doesn't make sense with regular PersistantVolumeClames, because the data cannot be migrated to another node.

Kubernetes does support a way to define pod inline ephemeral volumes (PVCs) that only lives as long as a pod lives, the Generic ephemeral inline volumes would in many cases be a better option then the emptyDir type volume.

Every component that supports persistentVolume should also support this new inline PVC option.

The rendered statefulset might look like this ingester snippet.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mimir-ingester
spec:
  template:
    spec:
      containers:
      - name: ingester
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      volumes:
      - configMap:
          items:
          - key: mimir.yaml
            path: mimir.yaml
          name: mimir-config
        name: config
      - configMap:
          name: mimir-runtime
        name: runtime-config
      - name: storage
        ephemeral:
          volumeClaimTemplate:
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 30Gi
              storageClassName: ephemeral-storage-class

The implementation in the values.yaml might look like

ingester:
  persistentVolume:
    enabled: true
    inline: true
    accessModes:
      - ReadWriteOnce
    size: 30Gi
    storageClass:  ephemeral-storage-class

Additional context

Using node local ephemeral disks is cheap and high performant. In combination with data replication, components like the ingesters would perfectly fine run on top of such disk. Losing one node will not introduce data loss due to data replication.

We have been running with a similar setup for a while by customizing the helm rendered manifests, and it seems to be working very good.

dimitarvdimitrov commented 1 year ago

interesting. Thanks for sharing this.

What happens during rollouts of the ingester statefulSet (for example to deploy a newer version of the helm chart)? Each of the three zones will be restarted and new pods may or may not fall on the same nodes as before, so new pods may or may not inherit the data from the old pods. Ingesters keep the last 2 hours of data on disk, so if new pods from more than 2 zones are scheduled on other nodes, then you may end up losing some of the data.

paalkr commented 1 year ago

Local ephemeral volumes will be wiped at every pod reschedule, even if a pod by accident ends up in the same node. So you need to handel that scenario. There are a few Mimir settings that will help.

snippet

mimir:
  structuredConfig:
    querier:
      query_store_after: 15m
    blocks_storage:
      tsdb:
        retention_period: 6h
        ship_interval: 1m
        memory_snapshot_on_shutdown: true
        flush_blocks_on_shutdown: true
      bucket_store:
        # Blocks with minimum time within this duration are ignored, and
        # not loaded by store-gateway. Useful when used together with
        # -querier.query-store-after to prevent loading young blocks, because there
        # are usually many of them (depending on number of ingesters) and they are not
        # yet compacted
        ignore_blocks_within: 0s
        sync_interval: 5m
        metadata_cache:
          bucket_index_content_ttl: 1m
          tenants_list_ttl: 1m
          tenant_blocks_list_ttl: 1m
          metafile_doesnt_exist_ttl: 1m

Also during a rolling update of underlying nodes that hosts the ingesters and store-gateway pods you might want to replace nodes zone by zone.

I'm fully aware that running with local ephemeral volumes isn't the mainstream setup, but it might be very cost effective in certain cases.

We have experimented with this in a staging environment by manually manipulating the resources, and it seems to be working very well. We are able to replace nodes and pods without losing metrics.