oliver006 / redis_exporter

Prometheus Exporter for ValKey & Redis Metrics. Supports ValKey and Redis 2.x, 3.x, 4.x, 5.x, 6.x, and 7.x
https://github.com/oliver006/redis_exporter
MIT License
3.15k stars 876 forks source link

Make redis_exporter cluster-aware #185

Closed KushalP closed 6 years ago

KushalP commented 6 years ago

Problem statement Many cloud providers offer a redis cluster as a service. These serviced clusters don't allow access to all of the underlying redis instances in a simple way. Instead, they expose a single endpoint that cluster-aware clients need to connect to.

Desired outcome Make redis_exporter cluster-aware such that it can connect to a single instance in the cluster and track metrics for all nodes in the cluster.

oliver006 commented 6 years ago

Thanks for raising this issue. I don't run Redis in a clustered setup or via e.g. Amazon ElastiCache so I'm not familiar with what metrics you're missing out on that could be extracted. What does redis-cli INFO ALL look like on such a clustered setup? Does it really expose all the stats from all the nodes?

KushalP commented 6 years ago

Working with redis in clustered mode

When redis is run in "clustered" mode you can connect to a single node and run CLUSTER NODES to see all nodes in the cluster. Below is an example of a 6 node cluster, with 3 shards (masters) and a replica (slave) for each shard (master):

> CLUSTER NODES
5ad08ee539856eab6d1e42a31f6df4a0df6e112b 10.225.51.158:6379@1122 master - 0 1534326728200 3 connected 0-5461
28abc4fac6118e44341ced5adce40291a267dc70 10.225.51.113:6379@1122 slave 6b10b51298ae576bfa0d05d4f287c2d713758fac 0 1534326727193 2 connected
6b10b51298ae576bfa0d05d4f287c2d713758fac 10.225.51.48:6379@1122 master - 0 1534326730211 2 connected 10923-16383
8005f38d45750fc851ad48c4e9f4e674674b74be 10.225.51.75:6379@1122 master - 0 1534326729205 0 connected 5462-10922
6d0d97db233dd5b2546fc490461f3838e59a5b72 10.225.51.57:6379@1122 slave 8005f38d45750fc851ad48c4e9f4e674674b74be 0 1534326727000 0 connected
2e28959d10cdce16c7e8343e9102a9753dbb9068 10.225.51.103:6379@1122 myself,slave 5ad08ee539856eab6d1e42a31f6df4a0df6e112b 0 1534326728000 1 connected

You could then use this information to connect to each node and fetch any metrics you needed.

Output of INFO ALL

The output of INFO ALL on a single node looks like the following:

> INFO ALL
# Server
redis_version:4.0.10
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:cluster
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:1
run_id:8cbcf867668f2cab57decbc48dd874cd6464ab7b
tcp_port:6379
uptime_in_seconds:1092
uptime_in_days:0
hz:10
lru_clock:7599966
executable:-
config_file:-

# Clients
connected_clients:5
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0

# Memory
used_memory:5721024
used_memory_human:5.46M
used_memory_rss:7548928
used_memory_rss_human:7.20M
used_memory_peak:5842976
used_memory_peak_human:5.57M
used_memory_peak_perc:97.91%
used_memory_overhead:5618390
used_memory_startup:4452192
used_memory_dataset:102634
used_memory_dataset_perc:8.09%
used_memory_lua:37888
used_memory_lua_human:37.00K
maxmemory:436469760
maxmemory_human:416.25M
maxmemory_policy:volatile-lru
mem_fragmentation_ratio:1.32
mem_allocator:jemalloc-4.0.3
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:0
rdb_bgsave_in_progress:0
rdb_last_save_time:1534325530
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:8
total_commands_processed:2853
instantaneous_ops_per_sec:1
total_net_input_bytes:117735
total_net_output_bytes:4385101
instantaneous_input_kbps:0.04
instantaneous_output_kbps:0.05
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:0
keyspace_misses:0
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:slave
master_host:10.225.51.158
master_port:6379
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:54135
repl_sync_enabled:1
slave_read_reploff:54135
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:1a7790e576771d8e386ae5df05d55d0cc1eb14cb
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:54135
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1739
repl_backlog_histlen:52397

# CPU
used_cpu_sys:0.33
used_cpu_user:0.75
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Commandstats
cmdstat_replconf:calls=963,usec=1419,usec_per_call=1.47
cmdstat_clusteradmin:calls=802,usec=135273,usec_per_call=168.67
cmdstat_info:calls=671,usec=36128,usec_per_call=53.84
cmdstat_config:calls=5,usec=68,usec_per_call=13.60
cmdstat_ping:calls=411,usec=247,usec_per_call=0.60
cmdstat_command:calls=1,usec=316,usec_per_call=316.00

# SSL
ssl_enabled:no
ssl_connections_to_previous_certificate:0
ssl_connections_to_current_certificate:0
ssl_current_certificate_not_before_date:(null)
ssl_current_certificate_not_after_date:(null)
ssl_current_certificate_serial:0

# Cluster
cluster_enabled:1

# Keyspace

The tell that you're in a redis cluster is that the above includes the following:

# Cluster
cluster_enabled:1
oliver006 commented 6 years ago

Thanks, that's helpful.

Just to clarify something (again, not really familiar with Amazon ElastiCache): in CLUSTER NODES I see IP addresses like 10.225.51.103:6379 - are these valid IP addresses that I can connect to as long as my exporter runs on a VM within the same VPC?

And a more general thought: the prometheus' way of doing things is to keep the service discovery aspect outside of the exporter (see #174 ) for related discussion so not really sure if we should start pulling this into the exporter or if this should be the job of your orchestration/service discovery system.

KushalP commented 6 years ago

are these valid IP addresses that I can connect to as long as my exporter runs on a VM within the same VPC?

Yes.

the prometheus' way of doing things is to keep the service discovery aspect outside of the exporter

I would prefer to do this, to find the targets, but I've not been able to figure out a sane way to query the individuals nodes out. I'm going to look into it now.

KushalP commented 6 years ago

It's possible to get the endpoint for all elasticache endpoints it seems by using the following query:

aws elasticache describe-cache-clusters --show-cache-node-info

I'm not entirely sure how to get Prometheus to do the same query as it uses the EC2 service discovery tooling.

oliver006 commented 6 years ago

I think the way to do this would be to use the file_sd_config and have a little script periodically pull the new info via the aws elasticache command and then update the file and trigger a config reload of the Prometheus server.

KushalP commented 6 years ago

Closing this issue for now as the file_sd_config might be enough to go on.

SuperQ commented 6 years ago

One possible change is to allow a target param to the exporter's /metrics endpoint. This would create a "proxy" exporter, similar to the snmp_exporter and blackbox_exporter. This would allow Prometheus to continue to drive service discovery, while supporting several targets.

Another option, is we could pressure Amazon to include Prometheus metrics in their cloud service. :grin:

oliver006 commented 6 years ago

The target param is worth looking into but would need a bit of refactoring to reduce/remove global state (see #126 ). If you could get Amazon to straight up export Prometheus metrics, that'd be my preferred solution ;-)

Yagyansh commented 3 years ago

Hi @oliver006 . Was is the conclusion for this? Have we found a way to get metrics from all the nodes of a Redis Cluster without using a script?

@KushalP How is the script thing working out for you and do you see any overheads?

Thanks. Looking for a solution for this, have the exact same use-case.