icecube / skymap_scanner

A distributed system that performs a likelihood scan of event directions for IceCube real-time alerts using CPU cluster(s) and queue-based message passing.
5 stars 2 forks source link

Refactoring of alert listener #28

Closed mlincett closed 1 year ago

mlincett commented 2 years ago

The alert listener (alert_v2_listener.py) is designed to parse events received through ZeroMQ and invoke millipede scans under certain conditions.

All the logic is handled in a function passed as a callback method to a realtime receiver object. In other words, it is the receiver that invokes the scanner logic on each event, rather than providing an object on which the listener can operate.

The listener has an optional mode (the one we currently use in realtime) in which a new instance of itself is spawned on a dedicated tmux session, with the purpose of scanning the pending event. Each new instance is spawned by invoking a bash script that in turn creates a tmux session and invokes the same python listener script inside it. The event is passed through as a pickle file.

@clagunas has explained to me that the main reason for this design was to:

  1. have a cleaner handling of the scans, separating the scan log from the listener log;
  2. avoid that a crash of the scan results in a crash of listener.

However, the resulting structure is quite convoluted, not very readable and a bit hard to maintain.

I am thinking of refactoring the listener, however I am not sure about the best design choice. Is multiprocessing actually needed or do we just need a proper handling of the logs (maybe using log files rather than stdout on different tmux sessions) and exceptions (so that a failure of an event scan does not cause the listener to fail)?

Any input / advice is welcome before I start going through with this.

dsschult commented 2 years ago

The tmux thing seems strange. But you say you're doing stdout/err logging, when you probably want to log to a file, so that's one of the issues that should really be fixed.

There is a case to be made with forking / opening a subprocess for the scan: if you really do something bad (memory error or segfault) then that can be caught. I'm unsure how important that is though, since that shouldn't happen. Normal errors should be caught with try/except. I will say that the bash script that invokes tmux seems way too complicated. Just calling subprocess should be enough if you wanted to keep separate processes.

But there should be more separation between the two parts here. If you're doing an external call, they really should be different scripts, not calling itself with different options. At a minimum they should be broken up so they can be tested properly. (this is likely a common problem, tacking on new features without improving the structure)


TL;DR: I see two options:

  1. Convert to two separate scripts, and call the other with subprocess.
  2. Migrate all of this into python modules (for the separation), and keep in the same process. You can still have a script that just parses args and calls the python modules.
mlincett commented 2 years ago

I am in favour of delegating the event scan to a subprocess. Having the entire analysis logic "passed" as a function to the ZMQ receiver makes everything a bit obscure.

ric-evans commented 2 years ago

I agree with both of you. We need to move away from obfuscated callbacks.

I'm going to use the "known event"/"non-ZMQ"/--event PKL_FILE alert listener code-path for https://github.com/icecube/skymap_scanner/pull/30. I will also disable the callbacks, and call perform_scan.py in a subprocess. We can replace the "callback functionality" afterward, by detecting changed outputted files (or stdout) instead. I think this can serve as a good starting path going forward for @mlincett's larger redesign/refit for this GH Issue.

mlincett commented 2 years ago

The branch listener_overhaul has now a working listener (tentatively named alert_daemon to avoid confusion / abuse of versioning) that just dumps a pickle file for every alert and spawns a subprocess that outputs to an individual log file (no more haunted tmux sessions).

I am going through the existing code that handles the event / cache / state / GCD info and progressively migrating functionalities to an object-oriented approach. The existing code mixes together a lot of data that ideally belong to different entities. Also there are a lot of weird back-and-forth conversions that should/will be scrutinised.

It is to be noted the pickle-based communication between the listener/daemon and the subprocesses is a bit rigid and not very pleasant to maintain, especially if we think we may have multiple recos depending on the same daemon. (Probably this could also be re-implemented as a message queue, but first I aim to have a working drop-in replacement for the current listener.)

ric-evans commented 2 years ago

@mlincett Please look at https://github.com/icecube/skymap_scanner/pull/30 in-depth before proceeding with any heavy refactoring (to code beyond the alert listener). I'm using/changing some of the "event / cache / state / GCD info" logic. I'm hoping to prevent merge conflicts as much as possible between our work.

mlincett commented 1 year ago

I am tentatively moving this development effort to a new repository as the functionalities of the two (listener and scanner) are now much more segregated.

ric-evans commented 1 year ago

Awesome, please link it here when you have it

mlincett commented 1 year ago

Alright, welcome SkyMist.