Hunter in long-running processes

Michael-F-Ellis commented 7 years ago

I've got a remote Python (3.5+) app that spawns (and re-spawns if need be) a half-dozen or so child processes. The processes interact tightly with each other thru zmq, so breaking into a running process with RemotePdb is problematic.

I've got good logging and most of the time that's sufficient to deduce the problem when something goes wrong. For the times when that's not enough, I'm thinking Hunter might be the answer. So here's what I'd like to be able to do:

Shell into the remote,
Run a bash command to activate hunter in one of my processes.
View the trace in stdout (or possibly redirect to a file)
Conveniently stop hunter (without killing the process) when I've seen enough.

The easy bits are:

Import hunter in each process's modules.
Add a USER2 signal handler that toggles hunter.trace/stop

I'm less clear about how to pass the module and filtering arguments to hunter.

Any help appreciated! Mike

Michael-F-Ellis commented 7 years ago

Here's what I've got so far starting from your examples. It's very general (and very dangerous in an unsecured environment) since it uses exec to evaluate whatever script exists as /tmp/hunterscript.py. All that's needed in advance is to import signalhunter to register the signal handler.

target.py

""" The  process to be traced """
import os
import time
import signalhunter  
while True:
    os.path.join('a', 'b')
    time.sleep(3)

signalhunter.py

""" 
Installs a signal handler that execs "/tmp/hunterscript.py" on SIGUSR2.
Sending the signal again toggles the tracing off. 
"""
import signal
import hunter

_hunting = False
def handler(signum, frame):
    """ Toggle hunter tracing """
    global _hunting
    if _hunting:
        _hunting = False
        hunter.stop()
    else:
        _hunting = True
        try:
            script = open("/tmp/hunterscript.py").read()
            exec(script)
        except:
            hunter.stop()
            _hunting = False

signal.signal(signal.SIGUSR2, handler)

hunterscript.py

""" A tracing script """
import hunter
hunter.trace(hunter.Q(module='posixpath', action=hunter.VarsPrinter('path')))

ionelmc commented 7 years ago

Ah yes, the tmpfile trick :)

I'm looking now how to make activation using manhole. Seem there are some issue with piping input, once sorted out I'll be back with an example.

I assume manhole works on your platform yes?

ionelmc commented 7 years ago

So with manhole would be having something like this:

import os
import sys
import time
import manhole
manhole.install(oneshot_on='USR2', connection_handler=manhole.handle_connection_exec)

def do_stuff():
    time.sleep(1)
    print(time.time())

print('PID:', os.getpid())
while True:
    do_stuff()

Then, to enable tracing: echo 'import hunter; hunter.trace()' | scripts/manhole-cli -2 <PID> And to stop: echo 'import hunter; hunter.stop()' | scripts/manhole-cli -2 <PID>

You should pip install https://github.com/ionelmc/python-manhole/archive/0de63f717350fbb59a0573f56c7c77ae085037bb.zip first tho, as the connection_handler option ain't released yet.

Let me know how it works.

PS. You can do it without manhole, in a secure fashion. You just need to open a unix domain socket, check if connecting user is the right one (SO_PEERCRED), and read the start/stop commands from the socket in your signal handler. For the client part you can use socat/netcat.

Michael-F-Ellis commented 7 years ago

Hey, thanks for all this! I'll give it a try but it may take a while as I'm trying to get a project in good shape before going on vacation for a couple of weeks.

In the meantime, I've run into something puzzling with trying to get output from hunter. The tmpfile approach I outlined works fine on simple test cases, but when I try to run it on the real application no output is written to the terminal window. That's perhaps not so surprising since these processes are spawned with multiprocessing.Process().

So I tried writing to a file, e.g.

# -*- coding: utf-8 -*-
""" A tracing script """
import hunter
with open('/tmp/hunterout', 'w') as f:
    hunter.trace(
        hunter.Q(
            module='pi2_interface',
            action=hunter.CallPrinter(stream=f),
        )
    )

When I signal the process, e.g. kill -USR2 11119 let it run for a few seconds and then turn off tracing with the same signal, I find that /tmp/hunterout was created and ls -l shows the expected creation time, but the file is alway empty.

I know the main function in pi2_interface is running because the program is producing live data that couldn't come from anywhere else. Any idea what might be wrong?

ionelmc commented 7 years ago

Perhaps there's no module pi2_interface? Multiprocessing might be doing weird stuff (eg: module being __main__)

Michael-F-Ellis commented 7 years ago

Almost. When I put in a statement to log __name__ I found the module reports itself as modules.pi2_interface. The culprit is my own import statement rather than multiprocessing.

Michael-F-Ellis commented 7 years ago

Hey, now that I've got it working I just want to say that hunter is massively useful. I've already spotted two places where my app was doing a lot of unneeded lookups that were easily fixed with few lines of cacheing code. Thanks for writing and sharing it!

Is there (or could there be) an option to stop tracing after, say, 1000 lines of output?

ionelmc commented 7 years ago

You could subclass CallPrinter to count the events and call hunter.stop() I guess ...

What's the actual usecase here?

Michael-F-Ellis commented 7 years ago

Use cases:

Quick look at what an active process is doing.
Wait patiently for a rarely called module.

ionelmc commented 7 years ago

For 1 perhaps you only need a CLI tool to attach/redirect outputs/deattach on ctrl-c. Right?

For 2 I could provide an example on how to do it.

Michael-F-Ellis commented 7 years ago

I've got a script using tmpfiles that works pretty well if the processes have been started in a terminal session. The normal case is that the processes have been launched with nohup in rc.local. That prevents hunter's output from showing up if I try it from an ssh login. I think the right approach is to launch the master process with tmux instead of nohup so I can log in and attach.

I'm about to be traveling until the 11th or so, I'll resume this when I get back. Thanks for all your help!

ionelmc commented 7 years ago

For 1 there's https://github.com/ionelmc/python-hunter/issues/27 - considering implementation.

For 2, perhaps an api like this trace(StopAfter(1000, module='...', actions=...))

PS. New manhole package is released.

ionelmc commented 6 years ago

I think we can close this. Try the latest release.

ionelmc / python-hunter

Hunter in long-running processes #34