Open paalkr opened 1 year ago
interesting. Thanks for sharing this.
What happens during rollouts of the ingester statefulSet (for example to deploy a newer version of the helm chart)? Each of the three zones will be restarted and new pods may or may not fall on the same nodes as before, so new pods may or may not inherit the data from the old pods. Ingesters keep the last 2 hours of data on disk, so if new pods from more than 2 zones are scheduled on other nodes, then you may end up losing some of the data.
Local ephemeral volumes will be wiped at every pod reschedule, even if a pod by accident ends up in the same node. So you need to handel that scenario. There are a few Mimir settings that will help.
snippet
mimir:
structuredConfig:
querier:
query_store_after: 15m
blocks_storage:
tsdb:
retention_period: 6h
ship_interval: 1m
memory_snapshot_on_shutdown: true
flush_blocks_on_shutdown: true
bucket_store:
# Blocks with minimum time within this duration are ignored, and
# not loaded by store-gateway. Useful when used together with
# -querier.query-store-after to prevent loading young blocks, because there
# are usually many of them (depending on number of ingesters) and they are not
# yet compacted
ignore_blocks_within: 0s
sync_interval: 5m
metadata_cache:
bucket_index_content_ttl: 1m
tenants_list_ttl: 1m
tenant_blocks_list_ttl: 1m
metafile_doesnt_exist_ttl: 1m
Also during a rolling update of underlying nodes that hosts the ingesters and store-gateway pods you might want to replace nodes zone by zone.
I'm fully aware that running with local ephemeral volumes isn't the mainstream setup, but it might be very cost effective in certain cases.
We have experimented with this in a staging environment by manually manipulating the resources, and it seems to be working very well. We are able to replace nodes and pods without losing metrics.
Is your feature request related to a problem? Please describe.
Currently, if you disable persistentVolume for any component, like the ingesters, compactor, ruler or the store-gateway an emptyDir volume will automatically be mounted at /data. https://github.com/grafana/mimir/blob/d8bb72b9ee4e65e3225deec2b26c7f5f649f45f8/operations/helm/charts/mimir-distributed/templates/ingester/ingester-statefulset.yaml#L92-L95
Describe the solution you'd like
Some cloud providers do support node local ephemeral volumes, this is cheap and very high-performance disks. A CIS driver can be deployed to enable dynamic provisioning of PVCs on top of node local ephemeral volumes. Using node local storage doesn't make sense with regular PersistantVolumeClames, because the data cannot be migrated to another node.
Kubernetes does support a way to define pod inline ephemeral volumes (PVCs) that only lives as long as a pod lives, the Generic ephemeral inline volumes would in many cases be a better option then the emptyDir type volume.
Every component that supports persistentVolume should also support this new inline PVC option.
The rendered statefulset might look like this ingester snippet.
The implementation in the values.yaml might look like
Additional context
Using node local ephemeral disks is cheap and high performant. In combination with data replication, components like the ingesters would perfectly fine run on top of such disk. Losing one node will not introduce data loss due to data replication.
We have been running with a similar setup for a while by customizing the helm rendered manifests, and it seems to be working very good.