open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

Collect metric based on /proc/locks #18829

Closed ItsLastDay closed 1 year ago

ItsLastDay commented 1 year ago

Component(s)

receiver/hostmetrics

Is your feature request related to a problem? Please describe.

We're interested in metrics "how many file locks are taken right now" to understand the performance of our system better. I did not find a receiver that would export such metrics at a system level.

Describe the solution you'd like

Extend hostmetricsreceiver's "filesystem" scraper, so that it reads /proc/locks file on Linux systems and export the following metric:

system.filesystem.locks Unit: {lock} Metric type: gauge Value type: Int

Attributes:

/proc/locks also contains PID of the process that holds the lock, and [start, end] offset of the lock. We're not interested in that.

Describe alternatives you've considered

Create a separate receiver just for lock-related metrics.

Additional context

Documentation: https://man7.org/linux/man-pages/man5/proc.5.html.

Here's example raw data:

$ cat /proc/locks
1: POSIX  ADVISORY  WRITE 3438608 fe:00:3815356 0 EOF
2: POSIX  ADVISORY  WRITE 3438608 fe:00:3815614 1073741824 1073742335
3: POSIX  ADVISORY  WRITE 3438608 fe:00:3815260 0 EOF
4: FLOCK  ADVISORY  WRITE 2010 00:33:87 0 EOF
5: POSIX  ADVISORY  WRITE 3438608 fe:00:3817093 0 EOF
6: POSIX  ADVISORY  WRITE 3438608 fe:00:3817103 0 EOF
7: POSIX  ADVISORY  WRITE 3438608 fe:00:3819367 0 EOF
8: POSIX  ADVISORY  WRITE 3438608 fe:00:3815247 1073741824 1073742335
9: POSIX  ADVISORY  WRITE 3438608 fe:00:3850011 0 EOF
10: POSIX  ADVISORY  WRITE 3438608 fe:00:3817087 0 EOF
11: FLOCK  ADVISORY  WRITE 1873 fe:00:664289 0 EOF
12: POSIX  ADVISORY  WRITE 3438608 fe:00:3939304 0 EOF
13: POSIX  ADVISORY  WRITE 3438608 fe:00:3815553 0 EOF
14: FLOCK  ADVISORY  WRITE 4301 00:1a:9 0 EOF
github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

ItsLastDay commented 1 year ago

Side note: is there a general way to create metrics out of any file?

I imagine that once connectors are ready to perform "logs -> metrics" conversions, I will be able to use filelogreceiver to read any file, then do some parsing with transformprocessor and create metrics like "count of lines with such and such properties".

Is that a realistic image?

Context: I would want to implement https://collectd.org/wiki/index.php/Plugin:OpenVPN and https://collectd.org/wiki/index.php/Plugin:Tail analogues for OpenTelemetry. I believe the overall idea is the same: "parse file and do some metrics", so I imagine a unified solution to those.

andrzej-stencel commented 1 year ago

Regarding your proposal for a system.filesystem.lock_count metric, I think in general it fits the purpose of the Host Metrics receiver. Here are some questions/remarks I have to further refine the understanding of this proposal:

Regarding your question about a generic way to create metrics out of any data, here's a proposal that goes in this direction https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14753. (Haven't seen much activity on it recently though.)

ItsLastDay commented 1 year ago

The description of the metric you provided works for Linux. Are you able to provide this metric on other platforms, like Windows or MacOS?

Not sure, I lack knowledge about those systems. Maybe that's doable.

I think the metric name system.filesystem.lockswould be more in line with the pluralization docs The Unit of the metric would technically be {locks}, not 1.

I agree, thanks for remarks! I'll edit the first message.

Regarding your question about a generic way to create metrics out of any data, here's a proposal that goes in this direction https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14753. (Haven't seen much activity on it recently though.)

Thanks, I'll subscribe to that discussion.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dmitryax commented 1 year ago

I think this is a good metric to have. @ItsLastDay do you want to work on this?

ItsLastDay commented 1 year ago

I can work on it. However, currently we don't have a demand for this. Context: our team is providing a custom OTel collector distribution to our customers, and the customer that wanted /proc/locks metric says they have other blockers to adopt our collector - thus, they're not interested in /proc/locks for now.

If our customer changes their mind, or if other people want to use such metrics - I'll implement it, should be fairly straightforward (Linux only, I don't have experience with Windows and I'm not sure such data exists for Windows). I don't want to implement smth that's not needed by anyone :)

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.