openshift / origin-aggregated-logging

139 stars 230 forks source link

LOG-1213: Metric for inbound log data loss at the collector #2070

Closed pmoogi-redhat closed 3 years ago

pmoogi-redhat commented 3 years ago

Description

Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters

  1. totalbytes_logged_in_disk (written to disk per container process)
  2. totalbytes_collected_from_disk (read or collected by fluentd per container process)

/cc @alanconway @jcantrill /assign @alanconway

/cherry-pick

Links

jcantrill commented 3 years ago

/hold

pmoogi-redhat commented 3 years ago

get_gauge_or_counter is not working as per expectations, throwing some errors hence leaving it to get_gauge type for total_bytes_collected metric. require some help there. have tested the functionality by looking at prometheus dashboard if total_bytes_collected getting published correctly. it is working fine as per my stand-alone test environment.

alanconway commented 3 years ago

/approve

alanconway commented 3 years ago

Final names for the metrics:

log_logged_bytes_total: total number of bytes written to a single log file path, accounting for rotations.
log_collected_bytes_total: total number of bytes collected from a single log file.
Both metrics have the label "path" which is the absolute path to the log file.
openshift-ci-robot commented 3 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, pmoogi-redhat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/origin-aggregated-logging/blob/master/OWNERS)~~ [alanconway] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 3 years ago

New changes are detected. LGTM label has been removed.

alanconway commented 3 years ago

@jcantrill @pmoogi-redhat I think we can re-factor this code into a single, fully independent plug-in belonging to us, with no patches to fluentd. That will solve the labelling problems. The idea is to follow the model of the fluentd prometheus-tail-monitor: https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb#L1 That plugin scans for registered in-tail plugins and uses inside knowledge of the in-tail implementation to grab the read-position and inode values collected by in-tail. It samples on a timer, so it doesn't see every read - but that doesn't matter. In-tail keeps watching the old inode for a grace period of several seconds after the file rolls over, which gives us time to register the last read position -so we should get a full count. I've started writing this out, but I'm not sure we'll have time to get it done and tested for 5.1. I'll try to finish toda so @pmoogi-redhat can review & test it tomorrow.

openshift-ci[bot] commented 3 years ago

@pmoogi-redhat: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/unit 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test unit
ci/prow/elastic-operator-e2e 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test elastic-operator-e2e
ci/prow/images 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test images
ci/prow/cluster-logging-operator-e2e 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test cluster-logging-operator-e2e
ci/prow/clo-functional 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test clo-functional
ci/prow/smoke 4c81554d63aae6c326df680829dbc0b76fc05f27 link /test smoke

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
pmoogi-redhat commented 3 years ago

Separate PR is raised to enable this new metric as a new plugin - new_plugin_Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032. Hence closing this PR which reflects in_tail and fluent-prometheus-plugin changes.