Closed mlincett closed 1 year ago
The tmux thing seems strange. But you say you're doing stdout/err logging, when you probably want to log to a file, so that's one of the issues that should really be fixed.
There is a case to be made with forking / opening a subprocess for the scan: if you really do something bad (memory error or segfault) then that can be caught. I'm unsure how important that is though, since that shouldn't happen. Normal errors should be caught with try/except. I will say that the bash script that invokes tmux seems way too complicated. Just calling subprocess
should be enough if you wanted to keep separate processes.
But there should be more separation between the two parts here. If you're doing an external call, they really should be different scripts, not calling itself with different options. At a minimum they should be broken up so they can be tested properly. (this is likely a common problem, tacking on new features without improving the structure)
TL;DR: I see two options:
subprocess
.I am in favour of delegating the event scan to a subprocess. Having the entire analysis logic "passed" as a function to the ZMQ receiver makes everything a bit obscure.
I agree with both of you. We need to move away from obfuscated callbacks.
I'm going to use the "known event"/"non-ZMQ"/--event PKL_FILE
alert listener code-path for https://github.com/icecube/skymap_scanner/pull/30. I will also disable the callbacks, and call perform_scan.py
in a subprocess. We can replace the "callback functionality" afterward, by detecting changed outputted files (or stdout) instead. I think this can serve as a good starting path going forward for @mlincett's larger redesign/refit for this GH Issue.
The branch listener_overhaul
has now a working listener (tentatively named alert_daemon
to avoid confusion / abuse of versioning) that just dumps a pickle file for every alert and spawns a subprocess that outputs to an individual log file (no more haunted tmux sessions).
I am going through the existing code that handles the event / cache / state / GCD info and progressively migrating functionalities to an object-oriented approach. The existing code mixes together a lot of data that ideally belong to different entities. Also there are a lot of weird back-and-forth conversions that should/will be scrutinised.
It is to be noted the pickle-based communication between the listener/daemon and the subprocesses is a bit rigid and not very pleasant to maintain, especially if we think we may have multiple recos depending on the same daemon. (Probably this could also be re-implemented as a message queue, but first I aim to have a working drop-in replacement for the current listener.)
@mlincett Please look at https://github.com/icecube/skymap_scanner/pull/30 in-depth before proceeding with any heavy refactoring (to code beyond the alert listener). I'm using/changing some of the "event / cache / state / GCD info" logic. I'm hoping to prevent merge conflicts as much as possible between our work.
I am tentatively moving this development effort to a new repository as the functionalities of the two (listener and scanner) are now much more segregated.
Awesome, please link it here when you have it
The alert listener (
alert_v2_listener.py
) is designed to parse events received through ZeroMQ and invoke millipede scans under certain conditions.All the logic is handled in a function passed as a
callback
method to a realtime receiver object. In other words, it is the receiver that invokes the scanner logic on each event, rather than providing an object on which the listener can operate.The listener has an optional mode (the one we currently use in realtime) in which a new instance of itself is spawned on a dedicated tmux session, with the purpose of scanning the pending event. Each new instance is spawned by invoking a bash script that in turn creates a tmux session and invokes the same python listener script inside it. The event is passed through as a pickle file.
@clagunas has explained to me that the main reason for this design was to:
However, the resulting structure is quite convoluted, not very readable and a bit hard to maintain.
I am thinking of refactoring the listener, however I am not sure about the best design choice. Is multiprocessing actually needed or do we just need a proper handling of the logs (maybe using log files rather than stdout on different tmux sessions) and exceptions (so that a failure of an event scan does not cause the listener to fail)?
Any input / advice is welcome before I start going through with this.