rhettg / Tron

Next generation batch process scheduling and management
Other
11 stars 0 forks source link

Better support for services #40

Closed Roguelazer closed 13 years ago

Roguelazer commented 13 years ago

It would be most excellent if tron had better support for managing services. Rather than just having to start them as non-daemonizing jobs that last forever, or writing custom client-side code to handle checking daemon status and running that regularly, tron should be able to handle services directly. Here's the configuration format I had in mind:

service:
    name: worker_daemon
    count: 5
    lock_host: False
    pid: /var/run/worker_daemon_XXXXXX.pid
    command: "/usr/bin/worker_daemon --pid_file $PIDFILE --which XXX"
    interval: 1m
    respawn: True
    respawn_attempts: 3

Translation of this: a service named worker_daemon should be instanted 5 times, not necessarily on the same host. The PID will be put in /var/run/worker_daemon_000001.pid, /var/run/worker_daemon_000002.pid, et cetera. The command will actually be run as

/usr/bin/worker_daemon --pid_file /var/run/worker_daemon_000001.pid --which 001

Every minute, tron will connect to the relevant host and check for the running process (by running kill -1 on the contents of /var/run/worker_daemon_000001.pid, checking /proc/$PID/cmdline, or somesuch). If it isn't running, tron will attempt to respawn it. After failing to respawn it 3 times in a row, tron will mark it as disabled (and possibly send some sort of notification).

rhettg commented 13 years ago

Probably need to break this out and have a larger "how should services work" wiki page.

What does lock_host mean ?

I would vote for somehow combining the 'respawn' option into just one. Perhaps by default it always immediately respawns. You can then add respawn_limit to limit the number before giving up. Or respawn_backoff to control how many times to respawn before slowing down how fast we try to respawn.

Roguelazer commented 13 years ago

lock_host is a boolean as to whether all copies should be on the same host or on tron should be allowed to spread them out over multiple hosts.

The respawn changes seem reasonable.

rhettg commented 13 years ago

Put more of a designy document up here

rhettg commented 13 years ago

For splitting instances up over nodes... your options would be:

If the default behavior was ##1 (split evenly) then the other 2 could be configured by adjusting other parameters with no other loss right ? You could just change the value for count and/or configure it on less nodes.

rhettg commented 13 years ago

Closing this for now. Services support mostly works.