architm21 commented 7 years ago

when ip of cassandra changed , the following error where on error log 1.could not refresh cluster: failed to acquire lock: timeout 2.attempt to index field 'shm' (a nil value)

could not refresh cluster: could not set host status in shm: nil key

thibaultcha commented 7 years ago

I am going to need more information if you want me to help you. Ideally, please provide a minimal, reproducible example with a set of instructions on how you deployed your cluster, a minimal set of operations executed on it via this driver, and what are the results you are expecting. You are encouraged to use a tool like ccm to make such an example easy to reproduce.

Otherwise, I will at least need:

the version of this driver you are using
the C* version you are using
the number of nodes you spawned
the code you are executing with this driver (how is the Cluster created, what are the methods you are calling on it)
a full stacktrace of those errors (not just the error message).

Thanks

architm21 commented 7 years ago

version of driver : 1.0.0 (lastest from lua rocks) C : Cassandra 3.0.8.1293 | Native protocol v4 nodes : 3 code : https://github.com/architm21/thumnails/blob/master/cimage.lua errors : 1. 2016/12/05 05:53:05 [error] 8287#0: 81196 [lua] cimage.lua:79: could not retrieve images:could not refresh cluster: could not set host status in shm: nil key,

*81215 [lua] cimage.lua:79: could not retrieve images: could not refresh cluster: failed to acquire lock: timeout, client: 10.10.0.99, server: 10.10.1.50, request: "GET /cimage/9d8190c0-b899-11e6-bf31-d92b256f53ba.

thibaultcha commented 7 years ago

I did not look throughly at your code because I don't have the time right now, but one thing I noticed is that you are creating a new Cluster instance on each request. That is not how this module is supposed to be used.

You are to instanciate a Cluster once in the lifetime of your workers and then use its methods during the request/response lifecycles. Your current approach is very harmful to performance and most of all, leads to undefined behavior of this driver since it is not its intended use.

thibaultcha commented 7 years ago

I am currently on mobile but I shall provide you with better usage examples, and update those in the documentation.

architm21 commented 7 years ago

That would be very helpful. Waiting for the examples.

thibaultcha commented 7 years ago

Ok, on a desktop now. The idea is that you instanciate a Cluster only once throughout the lifetime of your Nginx workers. That means instanciating it in the init or init_worker phases. For example:

init_worker_by_lua_block {
  local Cluster = require 'resty.cassandra.cluster'

  local cluster, err = Cluster.new {
    shm = 'cassandra', -- defined by the lua_shared_dict directive
    contact_points = {'192.168.8.33', '192.168.8.11','192.168.8.60'},
    keyspace = 'cordiant_images',
    connect_timeout = 1000,
    timeout_read = 1000,
  }
  if err then
    -- ...
  end

  -- declared as a global (not ideal)
  cluster_instance = cluster
}

server {
  listen 8080;

  content_by_lua_block {
    local rows, err, cql_code = cluster_instance:execute("SELECT * FROM image_details WHERE id=?", {id})
    if err then
      -- ...
    end
  }
}

Or better, without declaring a global variable:

# nginx.conf

init_by_lua_block {
  local foo = require "foo"
  foo.do_init()
}

server {
  listen 8080;

  content_by_lua_block {
    local foo = require "foo"
    foo.do_content()
  }
}

-- foo.lua
local Cluster = require 'resty.cassandra.cluster'

local cluster_instance

local _M = {}

function _M.do_init()
  local cluster, err = Cluster.new {
    shm = 'cassandra', -- defined by the lua_shared_dict directive
    contact_points = {'192.168.8.33', '192.168.8.11','192.168.8.60'},
    keyspace = 'cordiant_images',
    connect_timeout = 1000,
    timeout_read = 1000,
  }
  if err then
    -- ...
  end

  cluster_instance = cluster
end

function _M.do_content()
  local rows, err, cql_code = cluster_instance:execute("SELECT * FROM image_details WHERE id=?", {id})
  if err then
    -- ...
  end
end

return _M

Both of those solutions will only create one long-lived instance of the Cluster module, which will be much more efficient and is how this driver is supposed to be used.

architm21 commented 7 years ago

thank you for the help. still getting the same error , do_content(): error getting rows : could not refresh cluster: could not set host status in shm: nil key,

thibaultcha commented 7 years ago

If you want more help from me, I need you to provide me with a minimal and fully reproducible example of your use case, as originally asked for. I cannot know what is going on so far because I do not know which operations you are making on your cluster while your application is running.

Please provide an example in the form of a single Nginx config and/or Lua script without memcache (and with init script sfor your cluster), and using ccm to reproduce the operations you are making on your cluster.

Thank you.

thibaultcha commented 7 years ago

And again: please provide full stack traces of the errors you are seeing.

zhenzhenyang commented 7 years ago

hi,do you use "ab" test the concurrency,I tried,just get 1000 qps according to your suggestion up on that put the new cluster instance at init phases。

zhenzhenyang commented 7 years ago

hi,I use "ab" test the concurrency,the qps is very low and hang up the cassandra down。

thibaultcha commented 7 years ago

I would need more info than that. See previous comments on this issue.

zhenzhenyang commented 7 years ago

2016/12/24 17:04:44 [error] 116424#0: 40493 lua tcp socket connect timed out, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:04:44 [warn] 116424#0: 40493 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 10.10.121.151 DOWN, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800"

2016/12/24 17:14:03 [error] 128874#0: 125 [lua] cass.lua:41: do_content(): cluster execute failed,err=could not refresh cluster: no host details for 10.10.121.138, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:14:03 [notice] 128873#0: 181 [lua] test.lua:3: 333333333333, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:14:03 [error] 128874#0: *125 [lua] test.lua:6: could not retrieve users: could not refresh cluster: no host details for 10.10.121.138, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800"

2016/12/24 17:14:08 [error] 128863#0: 3 [lua] cass.lua:41: do_content(): cluster execute failed,err=could not refresh cluster: failed to acquire refresh lock: timeout, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:14:08 [error] 128863#0: 3 [lua] test.lua:6: could not retrieve users: could not refresh cluster: failed to acquire refresh lock: timeout, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800"

[error] 128880#0: 9 [lua] cass.lua:41: do_content(): cluster execute failed,err=could not refresh cluster: failed to acquire refresh lock: timeout, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:14:08 [error] 128880#0: 9 [lua] test.lua:6: could not retrieve users: could not refresh cluster: failed to acquire refresh lock: timeout, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800"

2016/12/24 17:14:34 [error] 128884#0: 2338 lua tcp socket read timed out, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800" 2016/12/24 17:14:34 [warn] 128884#0: 2338 [lua] cluster.lua:136: set_peer_down(): [lua-cassandra] setting host at 10.10.121.149 DOWN, client: 10.10.121.103, server: , request: "GET /cassandra HTTP/1.1", host: "10.10.121.103:8800"

hello,,I use "ab" test the concurrency,the qps is very low and produce a lot of errors upon,and I get the Flame Graph below

code: --test.lua local cass = require "cass" local cjson =require "cjson" local result,err,code=cass.do_content("select * from play_record.vod_play_record where partner='JS_CUCC' and mac ='00:19:f0:00:00:17' and chnname='动漫' and begintime = '2016-11-29 17:09:02';") if not result then ngx.log(ngx.ERR, 'could not retrieve users: ', err) --return ngx.exit(500) end ngx.say(cjson.encode(result))

--cass.lua local Cluster = require 'resty.cassandra.cluster'

local cluster_instance

local _M = {}

function _M.do_init() local pool={} table.insert(pool,"10.10.121.153:9042") table.insert(pool,"10.10.121.152:9042") table.insert(pool,"10.10.121.149:9042") table.insert(pool,"10.10.121.151:9042") table.insert(pool,"10.10.121.148:9042") table.insert(pool,"10.10.121.139:9042") table.insert(pool,"10.10.121.138:9042") table.insert(pool,"10.10.121.122:9042") table.insert(pool,"10.10.121.121:9042") table.insert(pool,"10.10.121.120:9042") table.insert(pool,"10.10.121.119:9042") local cluster, err = Cluster.new { shm = 'cassandra', -- defined by the lua_shared_dict directive contact_points = pool, keyspace = 'play_record', connect_timeout = 5000, timeout_read = 10000, } if err then ngx.log(ngx.ERR,"create cluster ERR=",err) ngx.exit(500) end

cluster_instance = cluster end

function _M.do_content(_qstr) if type(_qstr) ~= "string" then ngx.log(ngx.ERR,"do_content args err") end local rows, err, cql_code = cluster_instance:execute(_qstr) if err then ngx.log(ngx.ERR,"cluster execute failed,err=",err)

end

return rows,err,cql_code end

return _M

--nginx.conf http{ lua_shared_dict cassandra 10m;

init_by_lua_block { local cass= require "cass" cass.do_init() } } server {

listen 8800; default_type text/html; access_log logs/cassandra.log main; location ~ /cassandra { error_log logs/c_err.log debug;

content_by_lua_file /usr/local/openresty/nginx/cassandra/query.lua;

       content_by_lua_file /usr/local/openresty/nginx/cassandra/test.lua;

}

thibaultcha commented 7 years ago

On mobile now but I will review this once I have some time on my hand. FYI, I have achieved 10k QPS in my benchmarks with prepared statements on the 1.0 version of this driver.

zhenzhenyang commented 7 years ago

I also found I reload nginx,try curl get no data,and there is no one nginx log ,curl: (52) Emptyy reply from server ,I must restart the nginx .

thibaultcha commented 7 years ago

FYI: the contact points do not expect the port number to be included, you should strip them out. Second: why are you specifying that many contact points? If all those nodes are part of the same cluster, this driver will retrieve them already. You only need to specify one or two contact points. I also hope all of those are part of the same Cassandra cluster.

Have you ever used Cassandra with any of the Datastax drivers before? This driver is built with a similar approach. There is not much I can tell beyond that according to the bits you pasted... As long as you do not provide me with a minimalistic reproducible example (that would include the C* schema, and some test data, as well as OpenResty, Cassandra and lua-cassandra versions), I cannot help you further, nor do I have the time to.

thibaultcha commented 7 years ago

The title of this issue is "issues when ip of cassandra cluster changed.", but so far, I was not given any reproducible example about such a use-case... It would be useful if you want me to do something about it, or explain why it might not be supported.

atmesh commented 7 years ago

I am facing same issue. I ued volume snapshot from one node and attached it to other node. Now I am getting "Nodes /X.X.X.X and /X.X.X.X have the same token 8256225600046861013." Also, I am unable to use nodetool, it gives error "Failed to connect to '127.0.0.1:7199' - NoSuchObjectException: 'no such object in table'."

thibaultcha commented 7 years ago

@atmesh I don't think this error is related to this driver, sorry.

thibaultcha / lua-cassandra

issues when ip of cassandra cluster changed. #79

content_by_lua_file /usr/local/openresty/nginx/cassandra/query.lua;