osism / issues

This repository is used for bug reports that are cross-project or not bound to a specific repository (or to an unknown repository).
https://www.osism.tech
1 stars 1 forks source link

FluentD log buffer not being processed properly #271

Closed linwalth closed 8 months ago

linwalth commented 2 years ago

Rolling out FluentD per osism-kolla common role results in the fluentD building up log buffers, but not properly reducing them by sending them to the ES. Instead, buffer files keep building up unseemingly in the fluentd-data volume.

Restarting the FluentD helps for a minute, raising the transmission rate to ES, but then it gets stuck again.

Internally, the container uses up 100% of CPU on the FluentD process. We experimented by raising the thread number for the process to 8 threads, which lowers CPU usage, but does not meaningfully change transmission rate. Instead, the buffer keeps growing and creating more files.

A potential cause would be https://github.com/fluent/fluentd/issues/3817 or https://github.com/uken/fluent-plugin-elasticsearch/issues/909 but then again i would suspect more people than just us running into this problem with kolla.

linwalth commented 2 years ago

maybe related: https://github.com/uken/fluent-plugin-elasticsearch/issues/909

matfechner commented 2 years ago

@linwalth as first aim it is helpful to reduce debug logging (keystone), for longterm we must observe it. possible it is required to restart fluentd container regular

linwalth commented 2 years ago

With some help from the Monitoring SIG (specifically https://github.com/nerdicbynature) i was able to figure out a config that has now run 3 weeks without trouble. I am going to post it here, if someone runs into similar problems. This issue can be closed.

<match **>
    @type copy
    <store>
       @type elasticsearch
       host {{ elasticsearch_address }}
       port {{ elasticsearch_port }}
       scheme {{ fluentd_elasticsearch_scheme }}
{% if fluentd_elasticsearch_path != '' %}
       path {{ fluentd_elasticsearch_path }}
{% endif %}
       bulk_message_request_threshold 20M
{% if fluentd_elasticsearch_scheme == 'https' %}
       ssl_version {{ fluentd_elasticsearch_ssl_version }}
       ssl_verify {{ fluentd_elasticsearch_ssl_verify }}
{% if fluentd_elasticsearch_cacert | length > 0 %}
       ca_file {{ fluentd_elasticsearch_cacert }}
{% endif %}
{% endif %}
{% if fluentd_elasticsearch_user != '' and fluentd_elasticsearch_password != ''%}
       user {{ fluentd_elasticsearch_user }}
       password {{ fluentd_elasticsearch_password }}
{% endif %}
       logstash_format true
       logstash_prefix {{ kibana_log_prefix }}
       reconnect_on_error true
       request_timeout 15s
       suppress_type_name true
       reload_connections true
       reload_after 1000
       <buffer>
         @type file
         path /var/lib/fluentd/data/elasticsearch.buffer/openstack.*
         flush_thread_count 1
         flush_interval 15s
         retry_max_interval = 2h
         retry_forever true
       </buffer>
    </store>
</match>
berendt commented 2 years ago

Let's try to change the upstream configuration with https://review.opendev.org/c/openstack/kolla-ansible/+/856241.

berendt commented 2 years ago

@linwalth @nerdicbynature

Could you please provide details?

Also, please document in the release notes what the different options are introduced and what they are meant to do.
nerdicbynature commented 2 years ago

Moin,

the modification addresses multiple issues:

1) request_timeout needs to match ulk_message_request_threshold. HTTP-POST takes longer for a bigger ulk_message_request_threshold, hence the timeout should be significant higher than the usual upload time to ES. In our case 15MB usually need about 5 seconds, but sometimes need 15s.

2) retry_max_interval: Fluentd uses exponential backoff. If target ES has been configured to enforce an incoming rate limit, a series of failed HTTP-Uploads (maybe due to (1)) may lead to a buffer size, that always tops the rate limit and Fluentd does not recover from it.

3) retry_forever/reload_connections/reload_after: Fluentd sometimes silently drops the connection and gets stuck without any obvious reason. This params may help to reduce that. But actually does not prevengt Fluentd from getting stuck. Maybe it's a false assumption. Reloading connections may be a good idea tho.

Kind regards, André.

linwalth commented 1 year ago

Is this still being pursued?

berendt commented 8 months ago

We now have a new Fluentd version. I'm closing this because I think it's no longer relevant.