Add -recheck-with-time-limit support

hoffie commented 2 years ago

process-exporter already supports the -recheck flag which makes it run the whole matching logic on each scrape. This is very useful when trying to monitor processes which change their names shortly after start.

Sadly, -recheck carries a rather high performance penalty. At the same time, process name changes are very common directly after start, are seldomly expected during usage.

This commit introduces -recheck-with-time-limit which rechecks processes N seconds after their start and stops doing so afterwards. This combines the accuracy benefits of -recheck with the performance gains of not using -recheck.

I'd be glad if this could be accepted. Let me know if you want any changes. Thanks for working on process-exporter! :)

flixr commented 2 years ago

@hoffie this sounds great! I initially added the -recheck option and later also wanted to do something like this to limit the performance hits, but never got around to it... I was more thinking of an exponential backoff to skip full rechecks for scrapes, but I guess this would do the job just as well for our usecases. Just like you said the name usually only changes shortly after start of the process.

One question though: we are also using this on an embedded device which does not have it's own RTC battery and is usually time-synced via PTP to an external source. Meaning that system time can jump forwards or backwards at any time. I guess in this case this might not work, right? For that we probably would need to store monotonic time and compare against that, since we only care about duration for this limit.

hoffie commented 2 years ago

@flixr Thanks for the feedback!

I was more thinking of an exponential backoff to skip full rechecks for scrapes

This sounds like an interesting approach as well. It would probably require storing that iteration count per-process though, wouldn't it?

One question though: we are also using this on an embedded device which does not have it's own RTC battery and is usually time-synced via PTP to an external source. Meaning that system time can jump forwards or backwards at any time. I guess in this case this might not work, right?

Probably depends on the size of the jumps and the exact -recheck-with-time-limit value.

For that we probably would need to store monotonic time and compare against that, since we only care about duration for this limit.

Some ideas:

time.Time seems to support monotonic timestamps. The main issue will be getting a monotonic-time-enabled time.Time for the StartTime. Because watched processes may have been started before process-exporter, we can't use Now(). I'm not sure if there is a way to convert a wall clock Unix timestamp into a (guessed!) monotonic timestamp.
There's StartTimeRel which seems to be a Kernel-provided (/proc/PID/stat) monotonic timestamp measured in ticks after boot. However, I have no idea how to get the current value of ticks after boot in a sane way. In other words, I don't know what to compare with. It might be that Go's time.Now() includes exactly this value, but I've found no clear evidence that it does and -- more importantly -- that one should rely on that.
We could likely run our own monotonic clock in the process via time.Ticker. This would imply storing another int per-process though.

In the end, I'm currently unsure whether to go this path. All of it would introduce complexity and while it might solve one type of problem (jump in time for embedded devices), it might cause other inaccuracies (e.g. suspended systems). I'll leave it up to @ncabatoff to decide if and how this should be done. Maybe it would be an option to merge the current, simple approach and try refining later?

Ressources: https://man7.org/linux/man-pages/man2/clock_gettime.2.html https://blog.cloudflare.com/its-go-time-on-linux/

flixr commented 2 years ago

Agreed, this is clearly a very useful option and should cover the vast majority of use cases.

In my case the time jump can potentially even be years (no RTC, no battery, so no useful time after cold reboot). But this is a very special case and this PR should improve for 99% of cases and the option to always recheck is still there.

Probably the question is rather if there should be an additional option as you implemented it (which is great for backwards compatibility) or if this should be "merged" into one single, easier to understand option....

hoffie commented 2 years ago

In my case the time jump can potentially even be years (no RTC, no battery, so no useful time after cold reboot). But this is a very special case

Heh, yeah, especially when considering common server use cases.

Probably the question is rather if there should be an additional option as you implemented it (which is great for backwards compatibility) or if this should be "merged" into one single, easier to understand option....

Yes. In a green field implementation there should only be one option. For compatibility I chose to add a new flag and keep the existing. I don't see a way to deprecate or hide a flag using Go's standard flag package, which is used here (I guess kingpin's can do it).

Anyway, probably up to @ncabatoff as well. :)

As it stands, the following things should be decided:

[ ] Should it be merged?
[ ] Does it need support for monotonic time tracking? (My tendency: Not right now)
[ ] Should we keep -recheck working for compatibility? (My tendency: Yes. process-exporter is not 1.0, so breakage would be ok, but I guess many people rely on the existing flags)

hoffie commented 2 years ago

@ncabatoff Friendly ping -- is there anything I can do/improve to get this into a mergable state? (I'm fully aware that reviewing pull requests does need time and energy and priorities are different oftentimes!)

mika commented 1 year ago

Gentle ping @ncabatoff, any chance to get this merged? Would be nice to get it into the prometheus-process-exporter package for the upcoming Debian/bookworm stable release. :)

steinbrueckri commented 1 year ago

Gentle ping @ncabatoff again ✌️

steinbrueckri commented 5 months ago

@ncabatoff Friendly ping -- is there anything I can do/improve to get this into a mergable state?

alexmv commented 5 months ago

Just another friendly vote in favor of getting this PR merged.

steinbrueckri commented 5 months ago

It appears that there hasn't been a new release since 2021, which is disappointing given the excellent pice of this software. @alexmv @mika, is there an active branch where we can integrate this pull request? If not, I'm willing to create a fork and help maintain it.

guillemj commented 5 months ago

@steinbrueckri My impression is also that @ncabatoff might have lost interest in maintaining this (no commits either on git HEAD)? So perhaps forking and collecting interesting patches would be the way to go, and if this repo gets activity again the fork could be folded back in I guess.

For the Debian packages (https://tracker.debian.org/prometheus-process-exporter) I'd need to discuss with the other members of the Prometheus Packaging Team there, but I think it we could consider switching to another upstream repo.

mika commented 5 months ago

@steinbrueckri FTR, I'm in the same boat as @guillemj and if anyone is willing to maintain a fork, this sounds like the way to go for me. :) Thx! :pray:

ncabatoff commented 5 months ago

Folks are correct that I haven't much energy for maintaining OSS projects lately. Probably I should find a co-maintainer, or hand the project over to someone else entirely. In the meantime, I can certainly merge this change, which looks perfectly fine. Very sorry for having ignored it for so long.

mika commented 5 months ago

@ncabatoff no worries at all, I'm sure all of us can relate :) Thanks for accepting the MR, appreciated!

ncabatoff / process-exporter

Add -recheck-with-time-limit support #223