pmelchior / scarlet

hyperspectral galaxy modeling and deblending
MIT License
49 stars 22 forks source link

Implement a heartbeat in scarlet #252

Closed fred3m closed 2 years ago

fred3m commented 2 years ago

In https://github.com/lsst/meas_extensions_scarlet/pull/43 Arun implemented a fix on the Rubin side where scarlet can take a long time to deblend a patch, causing it to appear as if it has hung. First of all, as far as I know scarlet has never hung, at least not on ground based data, so this never bothered me, but I can see how it would be a concern to other people outside of the main developers.

In that ticket Arun implemented a fix that creates a log entry every 600 seconds (although the length of time is parameterized), but that fix only works if each blend i less than 600 seconds. Since the meas_extensions_scarlet loop is over blends, not iterations for a single blend, a large blend that takes over 600 seconds will still appear to hang. So this issue is to implement a similar fix to Arun's in scarlet, that will check for some specified length of time (eg. 600 s) and then print the current iteration (with a timestamp?) if it is still deblending.

pmelchior commented 2 years ago

Isn't that a case where education/documentation is favorable to code changes?

In addition, we have a logger anyone can listen to. And the iteration counter is updated anyway, so if people really wanted to know if scarlet is running, they already can. Suggest wontfix.

timj commented 2 years ago

The motivation is large scale processing runs with the PanDA workflow manager. That system has a Pilot process that monitors all the jobs it is responsible for and if it sees no log output after two hours it kills the job and restarts it. Obviously we could configure a longer timeout but that eventually fights against us when a hung job really is encountered. Having scarlet send a log message every N minutes doesn't seem like a bad idea in general. Isn't it good for all your users to see that something is happening rather than having them wait for hours and hope?

pmelchior commented 2 years ago

I understand the reason for the request. I wouldn't mind adding a entry to the logger that shows iteration counter and current state log-likelihood, maybe at logging.debug or .info level. So, anyone who wants to listen to it, can do so.

timj commented 2 years ago

Okay. I was responding to your "wontfix" comment and took that to mean you had no intention of adding a log message. We would not want this to be a debug message because turning on debug globally will give us millions of log records. info is fine (at Rubin we have a verbose log level between debug and info).

pmelchior commented 2 years ago

I didn't want to have yet another mechanism of reporting something. We already use the logger, so no objections to making it a little more lively.

timj commented 2 years ago

Ah. We were only ever talking about the logger.

fred3m commented 2 years ago

So my issue with #253 is that I don't think that we want (and I feel safe saying no one would want) a log entry on every single iteration, since that would completely flood the logger. I think that the better solution would be to subclass Blend on the Rubin side and overwrite Blend._callback so that it implements Arun's time based heartbeat algorithm, since time-based is typically the way that a heartbeat works.

timj commented 2 years ago

Right. Time-based or "every N iterations" makes sense. That's also why we use "verbose" logging (which we define as midway between DEBUG and INFO) such that the heartbeat won't even turn up by default. Most developers I've spoken don't want algorithms to be quiet for hours on end -- it makes them nervous -- so native scarlet solutions seems like a better idea than letting Rubin science pipelines be special.

fred3m commented 2 years ago

The problem with every N iterations is that for those very large blends, there can be upwards of thousands of sources in a single blend, some with quite large footprints due to the satellite trails, so a single iteration can take quite a long time (crowded fields will have even bigger issues). We could try to estimate N based on the total area of all of the footprints combined, but if I know @pmelchior he won't like something that customized. I feel like if we do something natively in scarlet it makes more sense to make a time based heartbeat, which is more standard.

pmelchior commented 2 years ago

Right. This very discussion is why I originally proposed wontfix because I don't want to assume what a scarlet user wants to write around the core functionality. So, it's either this PR with some N that we set or Arun can poll the internals, in particular Blend.loss, which gets updated after every iteration.