[Kong 2.0.4][Cassandra cluster] shm at full capacity

rishabh-gupta2 commented 3 years ago

@thibaultcha I'm using Kong 2.0.4 along with Cassandra as a database. Referring kong/db/strategies/cassandra/connector.lua and lib/resty/cassandra/cluster.lua :

lua-cassandra uses a shm for various operations (peers, prepare_and_execute, etc.). Few of these methods internally uses shm:safe_set which does not evict items and directly throws error if no memory available [Ref]

I'm observing a consistent increase for memory occupied for this shm. It reaches close to 100 percent within a span of 45-60 days. (As of now, I am rotating instances when it reaches close to 100 percent as a temporary solution)

Few questions regarding this behaviour:

Is this consistent increase expected? Is this value never expected to go down?
Possible reasons for this increase.
If this shm is at full capacity, will it increase the request error rate or latencies?
What factors to consider to tune the capacity for this shm?

thibaultcha commented 3 years ago

Hi, The shm also calls :set() and :get(), both of which call ngx_http_lua_shdict_expire to free-up space based on the LRU queue. Are you using Kong in a K8S environment or other, in which C* containers are frequently being rotated? Is cassandra_refresh_frequency still set to 60s and/or have you tried smaller and larger intervals to see if the shm fills up faster or slower? How are you monitoring the memory allocated by the shm? What errors are you experiencing when the shm is full?

rishabh-gupta2 commented 3 years ago

Are you using Kong in a K8S environment or other, in which C* containers are frequently being rotated?

Yes, I'm using Kong in a K8s environment but the containers are not being rotated frequently. Rotation of containers usually occur only in case of manual rotation when I'm observing low available memory for kong_cassandra shm. This manual rotation is done only as a temporary solution as I'm not sure of the impact when the capacity is full.

Is cassandra_refresh_frequency still set to 60s and/or have you tried smaller and larger intervals to see if the shm fills up faster or slower?

I'm using cassandra_refresh_frequency as 300s from the start. Haven't yet played around with this to check its effects on shm memory.

How are you monitoring the memory allocated by the shm?

Via prometheus metrics. [Ref]. Using this Grafana dashboard

What errors are you experiencing when the shm is full?

That's what I want to understand if there is any negative impact when this shm is full. It has never reached the state when it is full. When the capacity is around 95% full, I end up rotating the pods to reset it. Haven't observed any errors yet at 95% occupied state.

thibaultcha commented 3 years ago

shm memory metrics reported by the Grafana dashboard are collected via the PDK's kong.node.get_memory_stats which itself uses ngx.shared.DICT.free_space. As noted there, full shms are still subject to LRU eviction (in all non-safe read/write shm APIs as noted above). There should be no reason for rotating Kong instances based on the basis of this shm being full, at least none expected.

thibaultcha / lua-cassandra

[Kong 2.0.4][Cassandra cluster] shm at full capacity #143