Unable to start a new skein client daemon after having killed the previous one

nateagr commented 5 years ago

For some reason, the skein client daemon can be abrupty killed. It turns out we can't start a new skein client daemon after that for the following reason: When starting a new skein client daemon, a file ~/.skein/daemon is created. This file contains information such that socker address and PID of the skein client daemon. When the skein client daemon is abruptly killed, the process is killed but the file still exists. When trying to start a new skein client daemon, skein first checks whether a daemon is already started by looking for the file ~/.skein/daemon. If it exists, it is assumed a daemon is already started, otherwise it is assumed no daemon is started. This issue consists in robustifying the way skein checks whether or not a daemon is already started.

jcrist commented 5 years ago

For some reason, the skein client daemon can be abrupty killed

Can you go into more detail on this? When does this happen? Any ideas why? Have you looked at logs from the driver to see if its an error? I have not seen this behavior before.

This issue consists in robustifying the way skein checks whether or not a daemon is already started.

This is a bit tricky. We don't just check if there is a file for a global driver, we also send a ping to the saved address to check that the written pid is still a skein driver process. The issue here is that the driver could die, and a new process could start up with the same pid (unlikely, but possible), so we can't assume that the pid is correct. That's why skein driver stop will error if it fails to communicate with the driver process rather than killing the PID outright.

In the recent release, you can skip the connection checking behavior with stop, and just force cleanup.

# Will ensure you have a driver running in all cases
skein driver stop --force
skein driver start

We could add behavior to check if the driver pid no longer exists and handle this case, rather than binning this in with ConnectionError, but I'm a bit wary of adding this because driver failures are unusual and the user should probably be aware they're happening. This could be handled with a warning though, so I'm not 100% against that.

Is the --force flag sufficient for you?

jcrist commented 5 years ago

We could add behavior to check if the driver pid no longer exists and handle this case

I've changed my mind on this, and done this in #141. In this specific case, running skein driver start will fix the broken state and give you a new working driver (with a user-facing warning that this has happened).

vincent-grosbois commented 5 years ago

Can you go into more detail on this? When does this happen? Any ideas why?

It can happen if the client process is killed by kill -9 pid for instance. Then I think no proper clean-up can ever be done in this situation on the client side

jcrist commented 5 years ago

It can happen if the client process is killed by kill -9 pid for instance. Then I think no proper clean-up can ever be done in this situation on the client side

Yes, that can happen for any process. I just wanted to make sure this wasn't a bug in Skein where the process would die.

nateagr commented 5 years ago

Hi @jcrist, I should have been a bit more specific indeed. When talking about the daemon being abruptly killed, I was referring a situation such as what @vincent-grosbois described, not a bug with skein itself. Sorry for the misunderstanding.

I can provide a bit more of context about the situation we encountered (and leading to this issue). We are using skein python API. Before submitting jobs, we start a daemon by calling the static method "start_global_daemon". What we noticed is we get the exception "ConnectionError: Unable to connect to daemon" while trying to ping the process which PID is in the file (the daemon having been killed).

Edit: I've just seen your PR. Thanks !

What could be done is catching the exception around client = Client.from_global_daemon(). In core.py:

@staticmethod
    def start_global_daemon(log=None):
        """Start the global daemon.

        No-op if the global daemon is already running.

        Parameters
        ----------
        log : str, bool, or None, optional
            Sets the logging behavior for the daemon. Values may be a path for logs
            to be written to, ``None`` to log to stdout/stderr, or ``False`` to
            turn off logging completely. Default is ``None``.

        Returns
        -------
        address : str
            The address of the daemon
        """
        try:
            client = Client.from_global_daemon()
        except DaemonNotRunningError:
            pass
        except ConnectionError:
            os.remove(os.path.join(properties.config_dir, 'daemon'))
        else:
            return client.address
        address, _ = _start_daemon(set_global=True, log=log)
        return address

What do you think ? By the way, you've done an amazing job with Skein !

jcrist commented 5 years ago

Note that in the recent release, daemon was renamed to driver, as daemon was confusing for some users.

What could be done is catching the exception around client = Client.from_global_daemon(). In core.py:

That's not 100% correct, as we don't want to delete the file unless we're sure the driver has stopped (a ConnectionError could be for many reasons). The fix in #141 should fix your problem in a more robust way.

Before submitting jobs, we start a daemon by calling the static method "start_global_daemon".

The global driver is really to make the CLI experience more pleasant (otherwise you have to wait for the JVM to startup on every CLI command). Since you're using the Python API, is there a reason you don't just create a temporary driver process that only persists as long as your Python process?

client = skein.Client()
# your code here

nateagr commented 5 years ago

The global driver is really to make the CLI experience more pleasant (otherwise you have to wait for the JVM to startup on every CLI command). Since you're using the Python API, is there a reason you don't just create a temporary driver process that only persists as long as your Python process?

We want to limit the number of processes so using a common driver for all of our processes submitting jobs to yarn seemed a good idea to us.

jcrist commented 5 years ago

If you're only using a client to submit an application, they should be fairly short lived processes. The driver should also be fairly lightweight (at least with regards to cpu/memory, #117 indicated that there may be issues with the JVM taking up too many threads without additional tuning).

Anyway, I believe #141 should have resolved your issue.

nateagr commented 5 years ago

Hi @jcrist,

I'd like to reopen this issue since we're still facing the problem.

Some context: we are using skein to submit yarn jobs. In order to limit resource usage of local machine, we start a global driver and we create skein Clients which connect to the global driver to submit yarn jobs.

Problems we are facing:

For some reasons (example: reboot of local machine), the global driver has been killed but not correctly stopped (with stop_global_driver) . In consequence, we'd like to restart the global driver but we are unable to do it except if we delete the file .skein/driver.
There is nothing to prevent users to start several global drivers simultaneously. In consequence, we run into race condition when writing pid and socket address in file .skein/driver.

What we suggest: We'd like to be able to give a path (in /tmp) when starting a global driver. In this way:

We would be able to start several global drivers without race condition when accessing a single file.
We could provide a temporary path which won't be persisted when crashing. To do so, we could for example change function _start_driver. What do you think ?

jcrist commented 5 years ago

I'll reiterate that this is kind of an antipattern, really you want to have a single point where a global driver is started (perhaps tell the users to run skein driver start once beforehand). Is the driver ever stopped? If the info is written to a file in /tmp, what happens if the file is cleaned up before the driver is stopped? Who's responsible for stopping the driver? If you're writing to a different file in /tmp for each job, doesn't that still start multiple drivers per user?

In consequence, we'd like to restart the global driver but we are unable to do it except if we delete the file .skein/driver.

skein.Client.stop_global_driver(force=True) is not sufficient?

There is nothing to prevent users to start several global drivers simultaneously. In consequence, we run into race condition when writing pid and socket address in file .skein/driver.

We could also fix this race condition.

What you are asking is already possible (though with a less-nice api) - skein will use whatever directory SKEIN_CONFIG points at when writing the file (note that this is determined at import time, not at call time). We could add a kwarg for this (I'd probably call it config_dir) to start_global_driver, stop_global_driver, and from_global_driver, but I'd be skeptical of doing this.

Rather, for other projects we've found using skein.Client() to create a temporary driver that is cleaned up automatically when the python process exits to be a good model. No extra state on disk and no lingering processes. If you need to have multiple python processes connect to the driver (as you said in #167) you could connect with a new client by passing the address and security credentials to skein.Client (e.g. skein.Client(address=client.address, security=client.security)). We should make skein.Client objects pickleable, but that's how it'd work.

nateagr commented 5 years ago

If the info is written to a file in /tmp, what happens if the file is cleaned up before the driver is stopped? Who's responsible for stopping the driver?

We would write something like below to make sure the temporary file is kept while the driver is still alived:

with tempfile.TemporaryFile() as temp_file:
   with skein.Client.start_global_driver(temp_file):

What do you think ?

If you're writing to a different file in /tmp for each job, doesn't that still start multiple drivers per user?

Yes.

We are actually trying to tell users to start the driver once, either via CLI or python, but we still face situations where users start the mutliple drivers so we'd like to make it work instead.

skein.Client.start_global_driver(force=True) is not sufficient?

We could ping the driver to see if it is still alive, still reachable via GRPC and restart it if it is not ?

If you need to have multiple python processes connect to the driver (as you said in #167) you could connect with a new client by passing the address and security credentials to skein.Client (e.g. skein.Client(address=client.address, security=client.security)). We should make skein.Client objects pickleable, but that's how it'd work.

What you suggest is to create a first skein Client and to give its address and security to others skein Clients to connect to the same skein JAVA process ?

jcrist commented 5 years ago

What you suggest is to create a first skein Client and to give its address and security to others skein Clients to connect to the same skein JAVA process ?

Yeah. The easiest way would be to make them pickleable so they could be shared via mulitprocessing (I'll toss a quick PR up for this), but the code I've written as is should work. This removes any state on disk - the driver process sticks around for as long as the originating python process, and all child python processes can access the original client.

jcrist / skein

Unable to start a new skein client daemon after having killed the previous one #140