process-driven connections fail to reconnect after lengthy disconnect

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

Configure a console with a process-driven connection script to a remote
node (eg, via hp-lo100.exp).  Have conmand establish a connection to this
node, then disconnect this node from the network (simulating it being
brought down for service).  Reconnect this node to the network after a
couple of hours (to ensure keepalive has timed-out the connection).

What is the expected output? What do you see instead?

conmand should re-establish the connection once the node is brought back
online.  Instead, the console enters a disabled state that requires manual
intervention to re-establish the connection.

What version of the software are you using? On what operating system?

conman-0.2.5

Please provide any additional information below.

Posted by Christopher Maestas on April 08, 2010 - 18:34:

Hello,

If you have NODE1 being logged by conman, then the node is taken out for
service for some amount of time.  When the node is inserted back into the
system does conman notice the node again?  It seems to be the daemon
notices a server is gone, times out after some number of retries and may
never initiate a connection again after that.  I didn't know if it was a
driver issue in keep alive or something in the conman daemon process.  You
usually see something like after it tries:

---
<ConMan> Console [NODE1] disabled after 3 failed attempts at 04-08 09:30.

---

I've been able to re-initiate the connection (or the retry connections) by
trying a console broadcast, but wondered what the philosophy should be here.

Thanks,
-cdm

Posted by Chris Dunlap on April 08, 2010 - 23:28:

Each console type has its own hard-coded reconnect behavior.
As of 0.2.5, that behavior is:

  ipmi
    min-timeout=60s, max-timeout=1800s, exponential-backoff
  process
    timeout=60s, max-count=3, linear-backoff
  telnet
    min-timeout=15s, max-timeout=1800s, exponential-backoff
  unix
    timeout=60s, constant-backoff (or no delay if inotify is supported)

  (these constants are defined in server.h)

If the daemon's connection to a console is currently down, then
having the client connect to it should force the daemon to perform
an immediate reconnect attempt in all cases.

Based on your output, I presume the connection to NODE1 is via a script
(ie, a "process" device type).  The timeouts for the "process" type
were set low to protect against a bad script running amok.  But with
the current behavior, the console can become disabled in 3 minutes.
It's easy enough to increase the max count (eg, a value of 6 would
give a 15min window).  I could also change the backoff behavior so
there isn't a maximum count for the process type.  Any preferences?

Ideally, the configuration should allow for timeouts to be redefined.
But I've been holding off on that until I revamp the configuration
syntax.

-Chris

Posted by Christopher Maestas on April 09, 2010 - 22:01:

Thanks for the information.  This was using the hp-lo100.exp script. 
You're right the output leads to the process method.

However in the other cases, if telnet/ipmi is used and if a node is taken
out for > 30 minutes, then conman would consider it dead.  Maybe I'm
missing how it backs off here overall and this really means it ever backs
off somehow.

So 3 minutes and 30 minutes may be to short overall.  That would be the
case for some of the scenarios where nodes are serviced.  I guess a good
support worst case is a node gets taken out of service on Friday, but
doesn't get back into the mix until at least Monday (maybe).

I guess I would prefer a behavior that doesn't have a maximum count for the
process type.  (e.g. it just keeps trying and trying and trying).  Maybe
trying every minute may get annoying, so perhaps trying every 15 minutes is
a good start.

I agree with the ideal situation and having this be in the configuration
file instead.. :)

-cdm

Posted by Chris Dunlap on April 09, 2010 - 22:43:

The process type is the only one that can lead to a disabled state.
For the others, the timeout interval stops increasing once the maximum
is reached, but a timer keeps attempting reconnects at that maximum
interval until the connection is re-established (at which point the
timeout interval is reset to 0).

For example, the telnet reconnect attempts would be at 0, 15s, 30s, 1m,
2m, 4m, 8m, 16m, 30m, 30m, 30m, ...  If a client connects to a downed
console, the reconnect timeout interval is immediately reset to 0.

For the next release, I'll change the process reconnect behavior to
use an exponential backoff with min=60s, max=1800s, no max count (so
it'll be like the ipmi case).  And at some point, I'll add support
for reconnect behavior to be overridden in the config file.

-Chris

Posted by Christopher Maestas on April 12, 2010 - 17:57:

This sounds great.  Looking forward to testing when it's available.
I think it will help many of the *.exp scripts that are using the process
method.

-cdm

Original issue reported on code.google.com by chris.m.dunlap on 2 Jun 2010 at 11:52

GoogleCodeExporter commented 9 years ago

Original comment by chris.m.dunlap on 7 Jun 2010 at 6:29

Added labels: Milestone-0.2.6

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r993.

Original comment by chris.m.dunlap on 8 Jun 2010 at 11:10

Changed state: Fixed

tacho / conman

process-driven connections fail to reconnect after lengthy disconnect #3