This project has been retired. It is no longer actively developed or maintained. It has been transferred to Uber's uber-archive organization and kept here for posterity.
In some sense, a nanny watches kids. This module provides a cluster supervisor. A cluster is a group of worker processes that share a server or servers. The cluster supervisor:
The cluster supervisor manages workers and load balancers. This package provides a round robin load balancer. Each connection will go to the least recently connected worker.
Worker processes do not require any special programming to participate in a
cluster.
The worker will be loaded in proxy by a “thunk” worker that will establish
communication with the cluster supervisor, subvert Node.js's networking stack,
and execute the worker module.
Any attempt to listen for incomming connections with a net
, http
, or
https
server will be intercepted and managed by the cluster.
// launcher.js
var path = require('path');
var ClusterSupervisor = require("nanny");
var supervisor = new ClusterSupervisor({
workerPath: path.join(__dirname, "server.js"),
});
process.title = 'nodejs my-supervisor';
supervisor.start();
// server.js
process.title = 'nodejs my-worker';
// initialize worker stuff here...
Nanny exports a ClusterSupervisor
function that accepts a spec object and
returns an instance.
The cluster supervisor constructor may be called with or without new
.
The workerPath
is the only required property of the spec.
All other properties are options.
Cluster supervisors implement Node.js's EventEmitter
.
The ClusterSupervisor.prototype
also has LoadBalancer
and
WorkerSupervisor
constructors that can be overloaded by heirs.
clusterSpec.workerPath
requiredThe file system path of the Node.js executable that will run as the worker.
The worker script should be written as a normal Node.js program.
That is, there is no cluster module that the worker needs to load, nor does it
need to check whether it is running as a worker or supervisor.
The supervisor and worker scripts are separate.
Nanny will run a thunk module that will arrange for a seemingly unmodified
environment, except that the net.Server
has been subverted and a periodic
health monitor ("pulse") has been set up.
require.main
will be your worker module
and process.argv
will have
workerPath
at index 1, just as they would if Node.js ran your worker
directly.
clusterSpec.workerArgv
The command line arguments to pass to your worker script, as will appear at
index 2 and beyond of process.argv
to your worker.
By default, this is empty.
clusterSpec.workerCount
The number of workers to maintain.
Each worker will be assigned its 0-base index for its logical identifier.
The default worker count is the number of logical CPU's on the host machine
as reported by os.cpus().length
.
The worker count is optional but cannot be provided if you instead provide
logicalIds
.
clusterSpec.logicalIds
An array of logical identifiers for each worker that the supervisor should maintain. The default logical identifiers start with 0 and are as many as the host machine's CPUs.
The logical identifiers are optional but cannot be provided if you instead
explicate workerCount
.
Logical identifiers may be numbers or strings.
clusterSpec.logger
Overrides the default logger object for the cluster.
Nanny uses the methods error
, warn
, info
, and debug
, all of which must
accept the log string and an optional object containing additional contextual
information.
The fatal
method might be used in a future version.
The default logger is provided by the debuglog packaged module and all levels
are visible if you include nanny
in the NODE_DEBUG
space delimited
environment variable.
NODE_DEBUG=nanny node supervisor.js
clusterSpec.createEnvironment(logicalId)
The worker supervisor will call this method once before each time it spawns a
worker subprocess with the logical identifier of the worker as its first
argument and with the WorkerSupervisor instance as this
.
The environment creator must return an object with the entire map of
environment variables that the worker will need.
The default environment creator returns an object with a PROCESS_LOGICAL_ID
set to the worker's logical identifier.
Note that the returned environment is taken to be an exhaustive
environment, meaning that worker processes do not implicitly inherit the
supervisor process's environment.
To explicitly forward an environment, extent process.env
var extend = require('xtend');
new Supervisor({
createEnvironment: function (id) {
return extend(process.env, {
MY_TITLE: 'nodejs my-worker-' + id,
MY_WORKER_ID: id
});
}
})
clusterSpec.pulse
The interval at which workers will attempt to submit a health report, in miliseconds. By default, workers will not submit health reports.
At time of writing, the health monitor will keep a process alive unless it
calls process.exit
explicitly.
In a future version, stopping a worker should shut down the health checks so it
can gracefully exit.
clusterSpec.isHealthy
When a worker submits a health report, the supervisor calls this method to check whether the process is healthy with that report. The health report includes self reported memory usage and event loop metrics.
The following is a partial supervisor that will kill a worker if it reports that it has allocated more than 100MB of memory, will check health every second, will stop a worker if it fails to check in within 6 seconds (1 for pulse, 5 for timeout).
new Supervisor({
isHealthy: function (report) {
return report.memoryUsage.rss < 100e6
},
pulse: 1e3,
unhealthyTimeout: 5e3
});
memoryUsage
as returned by Node.js's process.memoryUsage
rss
system memory usage in bytesheapTotal
memory allocated by V8heapUsed
memory in use from slabs alocated by V8load
the number of miliseconds (in nanosecond resolution) that an enqueued
task had to wait before it was executed.Health : {memoryUsage: MemoryUsage, load: Number}
MemoryUsage: {rss: Number, heapTotal: Number, heapUsed: Number}
This function is called as a method of the corresponding worker supervisor, so
for example, this.id
is the corresponding worker logical id.
If isHealthy
returns a falsy value, the worker will be stopped with the
intention to restart. The force stop delay, restart delay, and restart count
options apply in this case.
The default isHealthy
method returns true regardless of the health report.
Note that this method will never be called, and thus unhealthy workers will not
be restarted, unless the supervisor is initialized with a pulse
.
clusterSpec.unhealthyTimeout
If this option is provided, the supervisor will automatically stop any worker that fails to report its health in a timely fashion. The unhealthy timeout is the number of miliseconds that the supervisor will wait after the expected check-in time. The force stop delay, restart delay, and restart count options apply in this case.
clusterSpec.workerForceStopDelay
Any worker that is stopped (including stops with the intent to restart) will be killed with prejudice if it fails to exit gracefully before this timeout in miliseconds.
The default delay is 5 seconds and can be overridden on heirs over
ClusterSupervisor.prototype.defaultWorkerForceStopDelay
.
clusterSpec.workerRestartDelay
If this option is provided, any worker that attempts to restart will be forced to wait this number of miliseconds between stopping and starting.
clusterSpec.respawnWorkerCount
If this option is provided, any time a worker is stopped with the intention to
restart (including both manual restart()
calls and automatic restarts, but
not including manual stop()
followed by start()
calls), the worker will not
restart if this many restarts have been attempted for this worker, regardless
of whether the restarts were "successful".
The supervisor does not distinguish sucessful and failed starts.
clusterSpec.serverRestartDelay
If this option is provided, the load balancer will wait this number of miliseconds between when a supervisor server stops due to an error and when the supervisor resumes listening on the corresponding port.
Note that this setting applies to the socket server running in the supervisor process for a given port, not to a worker.
clusterSpec.execPath
clusterSpec.execArgv
clusterSpec.cwd
clusterSpec.encoding
clusterSpec.silent
These options are passed through to Node.js's child process fork.
Particularly, execArgv
is distinct from workerArgv
. The execArgv
are
options for Node.js and are not visible to the worker.
These are useful for V8 and Node.js options.
A snapshot of these options are captured on each worker supervisor instance's
spec
property and are queried each time the supervisor spawns a new worker,
so it is possible to manipulate these values, per worker, before each restart.
The following documentation pertains to the methods of a cluster supervisor, as
returned by the ClusterSupervisor(clusterSpec)
function.
clusterSupervisor.start()
Sets the target state of each worker to "running" and initiates the sequence of operations necessary to get to that state from each worker's current state.
:warning: At time of writing, this method should only be called once. In the future it should be possible to restart a supervisor, and the start method should be idempotent.
clusterSupervisor.stop()
Sets the target state of each worker to "standby" and initiates the sequence of operations necessary to get to that state form each worker's current state.
:warning: At time of writing, a stopped cluster cannot be restarted. Individual workers can be restarted many times within the lifespan of the cluster supervisor.
clusterSupervisor.inspect()
Returns an object representing a snapshot of the system state of the entire supervisor.
The root object contains properties workers
and loadBalancers
, which are
each arrays of the respective state of each worker and load balancer.
Workers correspond to worker supervisors.
Load balancers correspond to ports managed by the cluster supervisor.
ClusterState : {workers: Array<WorkerState>, loadBalancers: Array<LoadBalancerState>}
See the documentation for worker supervisors and load balancers for their
respective state representations as returned by their own inspect()
methods.
clusterSupervisor.countWorkers()
Returns the number of allocated workers.
clusterSupervisor.countRunningWorkers()
Returns the number of workers that are currently running.
clusterSupervisor.countActiveWorkers()
Returns the number of workers that are not on standby, running or on their way to or from running.
clusterSupervisor.countRunningLoadBalancers()
Returns the number of load balancers that are listening on a port and accepting connections.
clusterSupervisor.countActiveLoadBalancers()
Returns the number of load balancers that are not on standby: running or on their way to or from running.
clusterSupervisor.forEachWorker(callback, thisp)
Calls the callback once for every worker supervisor before returning. The callback receives the worker supervisor, the logical identfier for that supervisor, and the supervisor itself.
clusterSupervisor.forEachLoadBalancer(callback, thisp)
Calls the callback once for every load balancer before returning. The callback receives the load balancer, the port number, and the supervisor itself.
WorkerSupervisor(workerSupervisorSpec)
The cluster supervisor constructs and exposes worker supervisor instances.
Each of these supervisors has a state machine with "standby", "running", and
"stopping" states.
The supervisor reacts to events and commands based on its current state.
For example, the start()
command on a running worker will do nothing,
but the start()
command on a stopping worker will cause the worker to restart
immediately after it transitions to standby.
workerSupervisor.id
The logical identifier assigned to this worker supervisor, one of the logical
identifiers constructed or provided to the cluster supervisor as logicalIds
or inferred from the number of processors.
workerSupervisor.inspect()
Returns a JSON serializable representation of the worker's state.
Workers in the standby state have been stopped or not started.
startingAt
indicates
the intended time to start, or may be null.Workers in the running state have been started and the subprocess may still be coming up.
pid
is the operating system's process identifier for the forked
worker.startedAt
is the number representing when the child process was
originally forked.Workers in the stopping state include worker lifetime statistics, startedAt
,
stopRequestedAt
, and forceStopAt
.
stopRequestedAt
is the number representing when the child process was
requested to stop or restart.forceStopedAt
is the number representing when the child process either
should be or was force stopped.isDebugging
flag indicates that the process was not actually stopped,
but that its debugger was activated.
It is the responsibility of the debugging user to manually stop the process.
When a process is in the debug state, you can, for example, connect to the
process with node debug -p <pid>
for a debug console.forcedStop
will be true.
A worker that has been given the debug()
command will activate the V8
debugger and enter the "stopping" state without the intent to force stop.WorkerState : WorkerStandbyState | WorkerRunningState | WorkerStoppingState
WorkerStandbyState : {
id,
state: "standby",
startingAt: Number
}
WorkerRunningState : {
id,
state: "running",
pid: Number,
startedAt: Number,
health?: Health
}
WorkerStoppingState : {
id,
state: "stopping",
pid: Number,
isDebugging: Boolean,
startedAt: Number,
stopRequestedAt: Number,
forceStopAt: Number,
forcedStop: Boolean,
health?: Health
}
workerSupervisor.isHealthy(report)
Returns whether the process is healthy (should not be stopped) based on its self reported memory usage and load metrics.
By default this returns true, but can be overridden on the cluster supervisor spec.
The following are commands that set the intended stable state of the worker and initiate the operations necessary to get to that state.
workerSupervisor.start()
workerSupervisor.stop()
workerSupervisor.restart()
workerSupervisor.reload()
workerSupervisor.forceStop()
workerSupervisor.debug()
workerSupervisor.dump()
LoadBalancer(loadBalancerSpec)
The load balancer constructor accepts a spec with the following properties.
logger
port
address
backlog
restartDelay
The load balancer is an event emitter. The cluster supervisor depends on the following event.
standby
when it has stopped.The load balancer implements the following methods.
inspect
which captures a snapshot of the load balancer state.addWorkerSupervisor(worker)
which adds a worker supervisor to the queue of
available workers accepting connections.removeWorkerSupervisor(worker)
which removes a worker from the rotation of
accepting workers.handleConnection(connection)
which sends a connection to one of the
workers, or buffers the connection until a worker becomes available.stop()
which requests that the load balancer tear down its server.
Load balancers are not restartable at time of writing, but can gracefully
handle all workers coming and going ad nauseam.
This method is only used for intentional teardown of the cluster for graceful
process exit.The cluster supervisor depends on the following properties of a load balancer.
requestedAddress
, from the speced addressrequestedBacklog
, from the speced backlogThe following properties are for the load balancer implementation and your information.
port
, the requested port.address
, the actual address received from the operating system.server
, the Node.js server listening on the supervisor.loadBalancer.inspect()
The load balancers can be in "standby", "starting", "running", and "stopped" states surrounding the Node.js socket server state machine. The load balancer will try to remaining "running" once it has been created, until it is expressly stopped.
port
is the port that the load balancer was requested for.
This port might be 0, in which case the corresponding actual port will
differ.address
is the actual address granted by the operating system to the
cluster supervisor process.backlog
is the requested number of connections to buffer by the first
worker to request this port.
This is often undefined.
The backlog is an infrequently recognized option of Node.js servers.LoadBalancerState : {
state: "standby" | "starting" | "running" | "stopped",
port: Number,
address: Address,
backlog: Number
}
All cluster supervisor logs include the process title as reflected by
process.title
.
initing master
Reports that the cluster supervisor has been constructed and will be initialized. If you see more than one of these messages, the process is constructing too many cluster supervisors.
cluster now active
Indicates that all workers have been started.
cluster now standing by
Indicates that all workers and load balancers have stopped.
This should only occur in response to a stop signal or a manual stop()
call.
cluster master received signal...killing workers
cluster checking for full stop
When a load balancer or worker supervisor changes its state, the cluster supervisor checks whether all of these have returned to standby, in which case goes to standby itself. This debug message shows the number of active workers and load balancers and the supervisor should go to standby if these are both zero.
worker state change
This message indicates that the worker has transitioned to a new state, one of
"standby", "running", or "stopping".
Regardless, the logger payload is the result of workerSupervisor.inspect()
.
worker fork error
This message indicates that the supervisor was unable to create a child process for the worker with the given logical identifier.
worker exited gracefully
worker exited due to signal
worker exited with error
worker post mortem
Indicates that a worker has stopped and reports various lifecycle metrics.
worker supervisor received non-object message
This indicates that the supervisor received a message from the worker processes over the Node.js IPC channel, but that message was not a regular object. The cluster supervisor has no code that would cause this, but it is possible for specialized supervisors to piggy-back on the IPC channel, and it is possible that certain race conditions particularly around exiting a process or closing the channel might produce a corrupt message.
It might be useful to review IPC interactions if this warning becomes frequent.
worker missing when sent signal
It is possible that a signal might be sent to a process after it exits. The cluster supervisor detects this case and compensates depending on whether the child process is supposed to be running. This should be unusual. A high occurrence of this warning could be an indication of thrashing or a bug.
worker server closed before it could receive error
It is possible for a worker to quickly start listening and stop listening before it receives an address or any connections from the supervisor.
// server.js
server.listen(0);
server.close();
This should be unusual since workers tyically stay alive long enough to full start, and if it does occur, should be due to user intervention.
worker server closed before it could accept a connection - redistributing
This message indicates that the following sequence of events occurred:
This should be rare, but can occur during normal operation. Unless these are frequent, no action should be necessary.
worker forced to shut down
Indicates that the worker did not shut down gracefully before the force stop delay passed and that the supervisor sent the kill signal.
There may be a defect in the worker that prevents it from shutting down in a
timely fashion, including possibly a signal handler that never follows up with
a graceful process exit.
If the process needs more time to shut down gracefully, the
workerForceStopDelay
option should be extended.
worker stopping because of failure to report health
Indicates that the process failed to report its health after it was due, plus
the margin unhealthyTimeout
.
This may be a symptom of being too busy, in which case the unhealthyTimeout
should be extended or load should be distributed elsewhere.
It may also be an indication of an infinite loop, either in JavaScript or
libuv.
worker unexpectedly stopped in standby state
Indicates that a child process has reported that it stopped more than once.
This is potentially dangerous if the worker restarts immediately, since the
second signal interferes with the new worker process.
If this occurs, please review the logs to ascertain what caused the first and
second handleStop
state transition and file an issue.
sending worker's server its listening address
Indicates that the supervisor now has an open server for the requested port and sends the actual address to the worker so that it can emit the listen event and store the actual address.
sending connection to worker
Indicates that the supervisor has sent a connection to a server in the worker.
spawned worker got message
Indicates that the cluster supervisor received a message that is not in its command vocabulary. This can occur normally if a specialized worker and supervisor piggy-back on the Node.js IPC channel.
connection backlog
Indicates the number of connections waiting to be distributed to workers.
The length
will be 0 and grew
false whenever the supervisor flushes
incoming connections to available workers.
This message will be logged each time a connection gets enqueued.
supervisor stopped listening on port
Indicates that the supervisor has stopped listening for requests to the given port.
supervisor failed to listen
Indicates that the load balancer was unable to listen on the requested port with the given error. This error gets distributed to all workers that attempt to listen on this port hereafter and the supervisor will need to be restarted.
supervisor shutting down server before confirmed to be listening
This will occur if the load balancer is stopped before it has finished starting. This should be rare in normal operation, and should only occur in response to manual shutdown.
worker subscribed to connections
Indicates that a worker has subscribed to incoming connections on this port.
supervisor requesting to listen on port
Indicates that the load balancer has requested a server on this port.
supervisor received address to listen on port
Indicates that the load balancer has a server that is now listening on the given actual address.
npm install nanny
npm test