prdn / zmq-omdp

ZMQ-OMDP : ZeroMQ Obsessive Majordomo Protocol - microservices framework for NodeJS - reliable and extensible service-oriented request-reply
https://fincluster.com
20 stars 7 forks source link

W_ERROR race? #6

Closed bmeck closed 10 years ago

bmeck commented 10 years ago

Every so often we see a W_ERROR, I think it has to do with a premature purge, but am unsure.

Server trace when W_ERROR is made

Trace
    at Broker.workerDelete (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Broker.js:297:9)
    at /home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Broker.js:356:9
    at Array.every (native)
    at Broker.workerPurge (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Broker.js:344:28)
    at null.<anonymous> (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Broker.js:47:8)
    at wrapper [as _onTimeout] (timers.js:261:14)
    at Timer.listOnTimeout [as ontimeout] (timers.js:112:15)

Client trace:

 Error: status W_ERROR
    at Readable.<anonymous> (/home/ubuntu/Documents/nship/lib/EventBus/Client.js:33:8)
    at Readable.emit (events.js:95:17)
    at /home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Client.js:141:14
    at Object.req.finalCb (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Client.js:188:11)
    at Client.onMsg (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Client.js:119:13)
    at null.<anonymous> (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/lib/Client.js:42:20)
    at emit (events.js:106:17)
    at Socket._flush (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/node_modules/zmq/lib/index.js:509:19)
    at _zmq.onReady (/home/ubuntu/Documents/nship/node_modules/zmq-omdp/node_modules/zmq/lib/index.js:192:12)
bmeck commented 10 years ago

It might also be related to restarting a broker but keeping worker processes alive?

prdn commented 10 years ago

That error should happen only when a worker dies during processing a request. That message is sent by a Broker that checks for Workers heartbeats. Are you sure that any Worker is dying?

bmeck commented 10 years ago

The worker process does not die and does not use worker.stop()

prdn commented 10 years ago

Are you using any timeout option in your client requests?

bmeck commented 10 years ago

nope

prdn commented 10 years ago

Ok I look forward into this

prdn commented 10 years ago

Could you try new release 0.086?

bmeck commented 10 years ago

nofix, but managed to isolate a different oddity based upon the error you mentioned:

  1. spawn broker and worker with service name foo
  2. restart worker with service name foo
  3. attempt to send message to service foo

causes W_ERROR, trying to force a disconnect/reconnect in our worker without restarting to see if it has the same symptoms

prdn commented 10 years ago

Please test 0.0.87. I have managed to fix some issues that could cause some race.

However I'd like to clarify a key point of zmq-omdp (and zmq in general). When you reconnect a Worker, Broker doesn't recognise the disconnection instantly. The reason is that zmq doesn't support socket disconnect events so you have to handle this state with heartbeating. So if a Worker disconnects and reconnects

I have reduced default heartbeating to 2.5 seconds so that a Worker is considered died by the Broker after 2.5 * 3 (7.5 seconds).

You can configure this options passing an object to the Client,Broker and Worker. Example new Client(broker, { heartbeat: 500 }) new Worker(broker, 'echo', { heartbeat: 500 }) new Broker(..., { heartbeat: 500 })

Make sure to use the same heartbeat value.

Let me know

bmeck commented 10 years ago

seems to fix the race bit, will have to think about adding W_ERROR protection for new requests (via retry?).

On Wed, Oct 1, 2014 at 6:51 PM, Paolo Ardoino notifications@github.com wrote:

Please test 0.0.87. I have managed to fix some issues that could cause some race.

However I'd like to clarify a key point of zmq-omdp (and zmq in general). When you reconnect a Worker, Broker doesn't recognise the disconnection instantly. The reason is that zmq doesn't support socket disconnect events so you have to handle this state with heartbeating. So if a Worker disconnects and reconnects

I have reduced default heartbeating to 2.5 seconds so that a Worker is considered died by the Broker after 2.5 * 3 (7.5 seconds).

You can configure this options passing an object to the Client,Broker and Worker. Example new Client(broker, { heartbeat: 500 }) new Worker(broker, 'echo', { heartbeat: 500 }) new Broker(..., { heartbeat: 500 })

Make sure to use the same heartbeat value.

Let me know

— Reply to this email directly or view it on GitHub https://github.com/prdn/zmq-omdp/issues/6#issuecomment-57559725.

prdn commented 10 years ago

Yes, we should add an automatic retry flag for requests. If the worker dies for any reason the request should be requeued and submitted to a new worker.

Paolo On 2 Oct 2014 02:26, "Bradley Meck" notifications@github.com wrote:

seems to fix the race bit, will have to think about adding W_ERROR protection for new requests (via retry?).

On Wed, Oct 1, 2014 at 6:51 PM, Paolo Ardoino notifications@github.com wrote:

Please test 0.0.87. I have managed to fix some issues that could cause some race.

However I'd like to clarify a key point of zmq-omdp (and zmq in general). When you reconnect a Worker, Broker doesn't recognise the disconnection instantly. The reason is that zmq doesn't support socket disconnect events so you have to handle this state with heartbeating. So if a Worker disconnects and reconnects

I have reduced default heartbeating to 2.5 seconds so that a Worker is considered died by the Broker after 2.5 * 3 (7.5 seconds).

You can configure this options passing an object to the Client,Broker and Worker. Example new Client(broker, { heartbeat: 500 }) new Worker(broker, 'echo', { heartbeat: 500 }) new Broker(..., { heartbeat: 500 })

Make sure to use the same heartbeat value.

Let me know

— Reply to this email directly or view it on GitHub https://github.com/prdn/zmq-omdp/issues/6#issuecomment-57559725.

— Reply to this email directly or view it on GitHub https://github.com/prdn/zmq-omdp/issues/6#issuecomment-57562348.

prdn commented 10 years ago

@bmeck can I close this issue?

bmeck commented 10 years ago

sure

On Tue, Oct 7, 2014 at 1:16 PM, Paolo Ardoino notifications@github.com wrote:

@bmeck https://github.com/bmeck can I close this issue?

— Reply to this email directly or view it on GitHub https://github.com/prdn/zmq-omdp/issues/6#issuecomment-58232686.