LOG-1213: Metric for inbound log data loss at the collector

pmoogi-redhat commented 3 years ago

Description

Currently in_tail plugin doesn't support publishing of inbound logloss - i.e. difference between total bytes written to disk (logfile) and total bytes collected or read by fluentd. This PR got changes in fluentd/lib/fluent/plugin/in_tail.rb, and fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb plugins to enable publishing of the below parameters

totalbytes_logged_in_disk (written to disk per container process)
totalbytes_collected_from_disk (read or collected by fluentd per container process)

/cc @alanconway @jcantrill /assign @alanconway

/cherry-pick

Links

Depending on PR(s):
Bugzilla:
Github issue:
JIRA: https://issues.redhat.com/browse/LOG-1213
Enhancement proposal: many rotations getting missed by fluentd, next enhancement proposal is to get fluentd know about all actual rotations done by CRIO/conmon process by reading extra meta data such as . Based on what rotations fluentd could track and which one got missed computing log-loss as [no-of-rotations-those-missed * maxsizeoflogfile + sum over all tracked rotations by fluentd as [totalbytes_logged_in_disk - totalbytes_collected_from_disk]

jcantrill commented 3 years ago

/hold

pmoogi-redhat commented 3 years ago

get_gauge_or_counter is not working as per expectations, throwing some errors hence leaving it to get_gauge type for total_bytes_collected metric. require some help there. have tested the functionality by looking at prometheus dashboard if total_bytes_collected getting published correctly. it is working fine as per my stand-alone test environment.

alanconway commented 3 years ago

/approve

alanconway commented 3 years ago

Final names for the metrics:

log_logged_bytes_total: total number of bytes written to a single log file path, accounting for rotations.
log_collected_bytes_total: total number of bytes collected from a single log file.
Both metrics have the label "path" which is the absolute path to the log file.

openshift-ci-robot commented 3 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, pmoogi-redhat

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/origin-aggregated-logging/blob/master/OWNERS)~~ [alanconway] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci-robot commented 3 years ago

New changes are detected. LGTM label has been removed.

alanconway commented 3 years ago

@jcantrill @pmoogi-redhat I think we can re-factor this code into a single, fully independent plug-in belonging to us, with no patches to fluentd. That will solve the labelling problems. The idea is to follow the model of the fluentd prometheus-tail-monitor: https://github.com/openshift/origin-aggregated-logging/blob/master/fluentd/vendored_gem_src/fluent-plugin-prometheus/lib/fluent/plugin/in_prometheus_tail_monitor.rb#L1 That plugin scans for registered in-tail plugins and uses inside knowledge of the in-tail implementation to grab the read-position and inode values collected by in-tail. It samples on a timer, so it doesn't see every read - but that doesn't matter. In-tail keeps watching the old inode for a grace period of several seconds after the file rolls over, which gives us time to register the last read position -so we should get a full count. I've started writing this out, but I'm not sure we'll have time to get it done and tested for 5.1. I'll try to finish toda so @pmoogi-redhat can review & test it tomorrow.

openshift-ci[bot] commented 3 years ago

@pmoogi-redhat: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/unit	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test unit`
ci/prow/elastic-operator-e2e	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test elastic-operator-e2e`
ci/prow/images	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test images`
ci/prow/cluster-logging-operator-e2e	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test cluster-logging-operator-e2e`
ci/prow/clo-functional	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test clo-functional`
ci/prow/smoke	4c81554d63aae6c326df680829dbc0b76fc05f27	link	`/test smoke`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

pmoogi-redhat commented 3 years ago

Separate PR is raised to enable this new metric as a new plugin - new_plugin_Metric-for-inbound-log-data-loss-at-the-collector-JIRA-ticket-LOG-1032. Hence closing this PR which reflects in_tail and fluent-prometheus-plugin changes.

openshift / origin-aggregated-logging

LOG-1213: Metric for inbound log data loss at the collector #2070

Description

Links