prometheus / client_rust

Prometheus / OpenMetrics client library in Rust
Apache License 2.0
484 stars 80 forks source link

Implementing a process collector #29

Open gagbo opened 2 years ago

gagbo commented 2 years ago

Hello,

I recently stumbled upon https://github.com/tikv/rust-prometheus/issues/392 and now I'm preparing a migration from rust-prometheus. I'm wondering how it would be possible to add a collector for process metrics just like what rust-prometheus does currently. I'm really inexperienced in those packages, and I don't really see how the ProcessCollector thing would translate here, to add some metrics related to the running process.

Can you tell me what would be necessary to add support for this ? I was thinking probably as another crate that exposes a single global function like crate::export_process_metrics(); so that crate would have to add timers to run the collection and hope that the timer runs often enough to give a precise measurement when prometheus scrapes the endpoint ?

Regards, Gerry

mxinden commented 2 years ago

Hi Gerry,

As you noted above open-metrics-client does not itself support exposition of process metrics today.

I was thinking probably as another crate

That sounds good. I am open to whether that crate would live in this repository or not. The former would likely make development easier as one can make atomic changes across both crates at once (with one pull request).

Off the top of my head the interface of such crate could look like:

fn register_process_metrics(registry: &mut Registry)

The crate would register multiple custom metrics, e.g. a gauge metric exposing the number of threads. Those custom metrics would each implement EncodeMetric and collect the concrete metric values from the system on EncodeMetric::encode.

@gagbo does the above make sense? Would you be interested in contributing such a crate?

gagbo commented 2 years ago

If I can find some time it’d be nice yeah, hopefully I’ll be able to find time to dig into this.

I think I’d like a way to specify running the encode function in a separate thread (which would require the registry to be Sync ?), so that all the metrics related calls can be handled by a different core if need be ? (My point is that you probably don’t need to be running in the same thread to collect process info, so offloading those calls to another core and another cache might be nice to have)

mxinden commented 2 years ago

If I can find some time it’d be nice yeah, hopefully I’ll be able to find time to dig into this.

Great. Let me know in case you need any help!

I think I’d like a way to specify running the encode function in a separate thread

Can you expand on what you want to optimize? All metric implementations are synchronized (e.g. Counter is just an atomic integer), thus allowing metric recording and metric collection to happen in different threads.

(which would require the registry to be Sync ?)

I am guessing that you are referring to interior mutability and not the Sync marker trait? I don't see why one would need interior mutability for the Registry itself, unless one wants to register metrics after startup, which I would argue is an anti-pattern.

gagbo commented 2 years ago

Hello,

Sorry for the late answer, I've been struggling to find time these days :(

For the time being I don't see myself working too much on this as our monitoring coverage is decent now and we have other priorities but I'll keep this in mind when it becomes an important subject again!

Can you expand on what you want to optimize?

I thought that all encode calls had to be executed in the same thread as the one recording the events, but as you clarified it shouldn't be an issue after all.

I am guessing that you are referring to interior mutability and not the Sync marker trait?

I was really thinking about the marker trait here (so that registries can be borrowed by an encode call in another thread), but it doesn't matter as it's irrelevant now, my understanding of the Registry structure seems wrong.

dovreshef commented 2 years ago

Hi, all

I've tried to take a stab at implementing this, since we need this as well. I looked at the current implementation for the other prometheus crate.

As part of the implementation it reads data from procfs to figure out stats on the process. In the current implementation, it is implemented as a custom collector, so the data is read once, and then used for all metrics. But with the design of the current crate, I found it hard to emulate this, since each metrics is its own separate thing, and the logic is spread out across each encode metric impl (If I understood it correctly).

I think it would help if it was also possible to have something analogous to the Collect trait for a group of metrics that share a source, so to speak.

I also found issue #49 which I think is showing other usecases where it would be helpful.

Just my 2c.

mxinden commented 2 years ago

@dovreshef would you be retrieving the information from the system in time or on an interval. I think the former is the Prometheus way.

Would the custom metric example not be the interface you need? I.e. be called on scrape to retrieve and generate the metrics?

https://github.com/prometheus/client_rust/blob/master/examples/custom-metric.rs

dovreshef commented 2 years ago

@dovreshef would you be retrieving the information from the system in time or on an interval. I think the former is the Prometheus way.

Sorry, I'm not sure I understand the question. I'll be retrieving the info in the EncodeMetric trait encode function, as demoed in the example.

There are multiple metrics that the process collector gathers, and the process to gather each one is similar. We read the /proc file system for the calling process, and extract the data from several files there. The issue is that they all share a few (relatively) expensive initial steps, and if I'll gather the data for each metric separately I'll be repeating the steps for each metric, which is a bit of a waste.

In the existing Prometheus client implementation all the metrics are gathered in a single collect call, and so they share those initial steps.

So I think it would help if we would have a way to collect/encode multiple metrics in a single call.

mxinden commented 2 years ago

The issue is that they all share a few (relatively) expensive initial steps, and if I'll gather the data for each metric separately I'll be repeating the steps for each metric, which is a bit of a waste.

Ah, sorry, I forgot having this discussion in the past.

As suggested on https://github.com/prometheus/client_rust/issues/49#issuecomment-1056769576, what do you think of the option to be able to register a Collector on a Registry. A Collector would be able to return a set of metrics where each metric can have a different metric type.

For the process collector, you would implement the Collector trait and register an instance with a Registry. On encode we would iterate the Collectors registered with the Registry, call Collector::collect and encode each returned metric.

Does that make sense @dovreshef? If so, would you like to prototype this?

As an aside, we would likely want to introduce StaticCounter, StaticGauge, ... so that you don't have to pay the cost of an AtomicU64 on each Collector::collect call.

dovreshef commented 2 years ago

Does that make sense @dovreshef? If so, would you like to prototype this? Sure.

So the design is:

trait Collector<'a, M>
where
    M: EncodeMetric + 'a,
    Self::List: Iterator<Item = &'a (Descriptor, M)>
{
    type List;

    fn collect(&self) -> Self::List;
}

Now I can see two ways to continue from here:

Either:

Or:

WDYT? Any other design?

mxinden commented 2 years ago
  • Have a new Collector trait that looks something like:
trait Collector<'a, M>
where
    M: EncodeMetric + 'a,
    Self::List: Iterator<Item = &'a (Descriptor, M)>
{
    type List;

    fn collect(&self) -> Self::List;
}

:+1:

Small nit, maybe type Collection; would be more intuitive.

  • Registry implements Collector, which returns RegistryIterator.

:+1:

  • text::encode calls the collect method on the registry to retrieve the iterator.

Instead of taking a Registry, text::encode could now even take some C: Collector.

Either:

* Registry no longer holds `sub_registries: Vec<Registry<M>>` but instead holds `Vec<Box<dyn Collector>>` .

* Registry will have a new function to add a subregistry as a `Box<dyn Collector>`.

* No need to add new fields.

That would be very clean in my opinion. My gut feeling tells me we will be running in some trait object issues. That said, I think we should give it a try.

Thanks @dovreshef for looking into this!

mxinden commented 2 years ago

@dovreshef are you still interested in contributing the Collector pattern? :)

mxinden commented 2 years ago

Cross referencing proposal for Collector here: https://github.com/prometheus/client_rust/pull/82

dovreshef commented 2 years ago

@dovreshef are you still interested in contributing the Collector pattern? :)

Sorry I missed that message (and disappeared) but I planned to do this for work, and got pulled to other issues.

baryluk commented 7 months ago

Is there standardized process collector available as a library? I would like to have as a minimum something similar to Go, Python:

Generic - MUST haves really:

# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 138100.24
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 26
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 3.7982208e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.70893894953e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.346695168e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19

(I often also add own process_uptime_seconds)

Rust specific, with some inspiration from Go (of course goroutine , gc does not make sense, but compiler version, thread count, any allocation statistics, i.e. allocator cache hits, fragmentation estimation, number and sum of allocations, etc would be nice)

go_build_info{checksum="",path="",version=""} 1
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.6526e-05
go_gc_duration_seconds{quantile="0.25"} 3.1393e-05
go_gc_duration_seconds{quantile="0.5"} 4.3811e-05
go_gc_duration_seconds{quantile="0.75"} 6.8233e-05
go_gc_duration_seconds{quantile="1"} 0.003802359
go_gc_duration_seconds_sum 4.637331431
go_gc_duration_seconds_count 76360
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 13
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.20.6"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.8776528e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 2.72165399288e+11
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.615149e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 6.057485348e+09
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 8.426024e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 1.8776528e+07
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 4.145152e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 2.0430848e+07
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 205940
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.744896e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 2.4576e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.7107767458342338e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 6.057691288e+09
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 2400
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 15600
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 282880
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 326400
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 2.123056e+07
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 689603
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 589824
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 589824
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 3.62386e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 10

Th

mxinden commented 7 months ago

Is there standardized process collector available as a library? I would like to have as a minimum something similar to Go, Python:

As the above conversation says, the prometheus-client crate is still missing the process collector functionality. Contributions welcome.

gmurayama commented 4 weeks ago

Hey @mxinden I started an implementation for this (#232), but I am not sure if I am in the right path. Before proceeding any further, can you take a look to see if it makes sense?

Thanks in advance! 😄

olix0r commented 4 weeks ago

We implemented this for Linkerd a while ago. Feel free to borrow from it, if useful:

https://github.com/olix0r/kubert/blob/main/kubert-prometheus-process/src/lib.rs

gmurayama commented 3 weeks ago

Nice, thanks a lot @olix0r! Most probably will do

mxinden commented 3 weeks ago

@olix0r great to see the Collector trait in action. Thanks for sharing.