Nginx responding 500 due to failed to acquire refresh lock: timeout

leandromoreira commented 6 years ago

Hi @thibaultcha ,

We started to use the (your awesome) driver in production, but we are getting too many 500’s. We first noticed this error:

[error] 24174#0: *441 lua entry thread aborted: runtime error: /opt/y/luajit/lib/x.lua:175: no host details for 10.132.96.67

And then this one:

[error] 24412#0: *596 lua entry thread aborted: runtime error: /opt/y/luajit/lib/x.lua:167: could not refresh cluster: failed to acquire refresh lock: timeout

Both errors happen when we try to call execute. This happens at a very steady frequency but it does not happen to all clients.

We created the shared dict inside http scope in nginx conf with 1m, the connection is created using 8 contact points with 1s timeout_connect.

Libraries Versions:

lpeg 1.0.1-1
lua-cassandra 1.2.3-0
lua-resty-lock 0.07-0
lua-resty-string 0.09-0
luabitop 1.0.2-3
luajson 1.3.4-1

The way we're using the driver:

cassandra_connection.get = function()
    session, err = Cluster.new {
      shm = 'cassandra', -- defined by the lua_shared_dict directive
      contact_points = contact_points,
      default_port = port,
      keyspace = 'keyspace',
      timeout_connect = timeout,
    }

  if err then
    ngx.log(ngx.ERR, 'could not connect to cassandra: ', err)
    return ngx.exit(500)
  end

  return session
end

-- and in code, we just call

  local session = cassandra_connection.get()
  local s = assert(session:execute("SELECT x, t, y FROM k WHERE name = ?", {x}))

-- we don't release any connection like we used to do 
-- and we always call this code per request

Did you have any idea of what could be causing this?

leandromoreira commented 6 years ago

by digging the source code it looks like it may be related to the resty lock, we don't override any lock option at all and the error seems to be related not with :execute method but :refresh and :get_peers instead.

thibaultcha commented 6 years ago

Hi,

Thanks for using the driver! I would like to start by saying that you might be using it wrong: it seems to me like you are creating a new cluster instance for each call to get(). Instead, you should cache a cluster instance in some upvalue or attribute of another instance, and reuse it to avoid constant calls to refresh() which are very expensive. This current snippet is probably drastically impacting your performance (fwiw, I can achieve about ~10k q/s with this driver on my 2013 MacBook Pro with NGINX and 3 Cassandra nodes running locally, just to give you an idea).

The second half of this example shows you one possible way of using the Cluster module and caching the instance in an upvalue for subsequent reuse. Like I said though, you can achieve the same result by caching the cluster instance in some other place (like an attribute of another instance, etc.).

leandromoreira commented 6 years ago

Hi @thibaultcha thank you, your suggestion helped us 🥇 🥇 🥇 , just to be sure we don't need to call cassandra_cluster_session.settimeout(1000) anywhere this is something done by the Cluster driver, is it?

By the way I thought it was a good idea to open the issue and describe it with details so if any other user faces the same problem they might find this solution. But it can be closed now, thanks a lot!

thibaultcha commented 6 years ago

Yes, timeout settings are handled by the Cluster module (connect/send/read).

Alright, closing this then!

thibaultcha commented 6 years ago

Also: you can further optimize performance by using prepared statements, and by reusing the arguments table (the {x} table in your snippet is instantiating short-lived Lua tables which under pressure/hot code path could be expensive).

leandromoreira commented 6 years ago

@thibaultcha one thing I realized was that there is not prepare method on _Cluster table, is that right? I mean, do I need to create two cassandra sessions, one for the cluster and one for the prepare?

I noticed that the code itself is prepared to receive the prepared statement through execute.

thibaultcha commented 6 years ago

@leandromoreira Correct, the Cluster module automatically prepares queries without you having to worry about anything (queries need to be prepared on each coordinator they are being run against, and need to be cached by the driver, etc... All this is automatic when using the Cluster module). You simply have to give the { prepared = true } option to cluster:execute().

See query_options and cluster:execute(). The later mentions one of its arguments as:

options table (optional) Options from query_options.

thibaultcha commented 6 years ago

There is no example in the documentation related to prepared queries from the Cluster module, but you can see it in action in this test. The test suite for this module is very complete and contains lots of detailed use-case examples (not necessarily production ready, as previously mentioned with regards to the caching of the cluster instance in upvalues).

leandromoreira commented 6 years ago

Thank you so much :) I was under the same impression when I read the code and I saw some stm being cached on private functions through :execute! this awesome! 👏 👏 👏 👏 👏 👏

leandromoreira commented 6 years ago

hi @thibaultcha we're using your driver to power the world cup lol but at the same time we're facing a strange behavior, I'll describe it because maybe you had faced something simular or can help with debug tips.

We have 3 nginx with lua (using lua-cassandra) servers serving the same content, we're seeing some inconstant 404 and when we further debug we noticed that it's not all the servers responding 404 to the same query and then if we keep asking for the content the same server will reply 200.

At first we thought it was the cassandra (temporatly) inconsistent nature but we never faced such behavio before anyway we did add to the execute with {consistency=cassandra.consistencies.quorum} but the problem persisted.

  -- this is the query
  local chunks = assert(session:execute(
    "SELECT ms FROM pl WHERE sn = ? AND ct = ?",
    {sn, ct}
  ))
 -- we tried to remove the assert to see if there was any error but no error was raised =(
 -- we tried to add {consistency: cassandra.quoruom} but it also did not work =(

Do you have any idea about what could raise such problem? or any debug tip we could try?

By the way I did the connection cache by worker (lua's global var) not on the init_by_lua_block. We also noticed that one of the machines was presenting the following dmesg nginx: page allocation failure: order:2, mode:0x104020 do you think that it's possible for any memory problem to help this issue to happen?

leandromoreira commented 6 years ago

By the way, is it possible to set consistency while using the cluster mode? I read the source code and I could not found this option being used the Cluster:execute the coordinator_opts has just two options.

thibaultcha commented 6 years ago

@leandromoreira Hi there!

Happy to hear lua-cassandra is useful to you :) Let's start with the easy one:

By the way, is it possible to set consistency while using the cluster mode?

Yes, certainly, the 3rd argument, option is the same of that of the single node module's options, as documented in https://thibaultcha.github.io/lua-cassandra/modules/resty.cassandra.cluster.html#_Cluster:execute

options table (optional) Options from query_options.

For the other questions you have, I am afraid I have too little information to be able to help you. It is unclear if this driver is at fault. For what it's worth, we've had numerous users (in Kong) relying on this driver and successfully tweaking the consistency option to their needs. Did you specify it correctly (as the proper argument in the call to execute(), and without any typo)? Are those binded values (sn and ct) consistent across each of those nodes? Is each server connecting to a different Cassandra peer, or the same one? Is the cluster itself healthy and responsive (you probably want to use nodetool to answer that question)?

Also, it is unclear whether the assert succeeds or not. Are you seeing it failing, without an error being returned (thus for unknown reasons), or is chunks simply an empty table?

do you think that it's possible for any memory problem to help this issue to happen?

I have never seen such an error before, sorry. Googling around does yield some interesting results for this though.

leandromoreira commented 6 years ago

Did you specify it correctly (as the proper argument in the call to execute(), and without any typo)?

yes, it doesn't happens always =/

Are those binded values (sn and ct) consistent across each of those nodes?

I'm sorry, I couldn't understand this.

Is each server connecting to a different Cassandra peer, or the same one?

how can I check that? I setup like 4 equals contact points for each nginx.

Is the cluster itself healthy and responsive (you probably want to use nodetool to answer that question)?

Yes, it is, even in our dashboard metrics, we even "factory reset" cassandra.

Are you seeing it failing, without an error being returned (thus for unknown reasons), or is chunks simply an empty table?

No errors just an empty table =/ as if it was receiving and empty response from cassandra but the other servers are responding with 200, I mean it's not that all the servers respond 404 all the time for all the requests, what bugs me even more...

leandromoreira commented 6 years ago

Here's another creepy thing, let's say we have three nodes with this behavior I described to you, I put a fourth one but doesn't put it on the LB pool. I then run a script against all of them and only the 4th is not returning the 404, nice, then we try to put it on the pool and kaboom it starts to behave exactly like the others, which make us think this is something related to load. No cassandra itself, maybe nginx/some module or os.

But we'll try to use the single node mode first then procceed to another kind of tests, this is letting me crazy.

jivco commented 6 years ago

@leandromoreira I've set up a Cassandra storage for HLS chunks using lua-cassandra driver after watching your presentation on NGINX Conf. You can check my configs to find some code examples.

leandromoreira commented 6 years ago

@jivco nice, I think you shouldn't rely on .. in your qry use a named query instead of concatenate and you should use {prepared=true} as well, another thing you should avoid at all costs ALLOW FILTERING, maybe it's better for you to use an external lua file, it'll help to run unit tests. 👍

leandromoreira commented 6 years ago

@thibaultcha thank you I think we solved it in here! :) it was a silly mistake of us!

thibaultcha commented 6 years ago

@leandromoreira Glad to hear!

jivco commented 6 years ago

@leandromoreira The table where I'm using ALLOW FILTERING is very small and static (it just holds info for generating variant playlists) and I'm caching the response with nginx for 1 second so there is no noticeable hit on performance, but you are right - it's not the optimal way of doing that ;)

thibaultcha / lua-cassandra

Nginx responding 500 due to failed to acquire refresh lock: timeout #112