Closed linwalth closed 8 months ago
@linwalth as first aim it is helpful to reduce debug logging (keystone), for longterm we must observe it. possible it is required to restart fluentd container regular
With some help from the Monitoring SIG (specifically https://github.com/nerdicbynature) i was able to figure out a config that has now run 3 weeks without trouble. I am going to post it here, if someone runs into similar problems. This issue can be closed.
<match **>
@type copy
<store>
@type elasticsearch
host {{ elasticsearch_address }}
port {{ elasticsearch_port }}
scheme {{ fluentd_elasticsearch_scheme }}
{% if fluentd_elasticsearch_path != '' %}
path {{ fluentd_elasticsearch_path }}
{% endif %}
bulk_message_request_threshold 20M
{% if fluentd_elasticsearch_scheme == 'https' %}
ssl_version {{ fluentd_elasticsearch_ssl_version }}
ssl_verify {{ fluentd_elasticsearch_ssl_verify }}
{% if fluentd_elasticsearch_cacert | length > 0 %}
ca_file {{ fluentd_elasticsearch_cacert }}
{% endif %}
{% endif %}
{% if fluentd_elasticsearch_user != '' and fluentd_elasticsearch_password != ''%}
user {{ fluentd_elasticsearch_user }}
password {{ fluentd_elasticsearch_password }}
{% endif %}
logstash_format true
logstash_prefix {{ kibana_log_prefix }}
reconnect_on_error true
request_timeout 15s
suppress_type_name true
reload_connections true
reload_after 1000
<buffer>
@type file
path /var/lib/fluentd/data/elasticsearch.buffer/openstack.*
flush_thread_count 1
flush_interval 15s
retry_max_interval = 2h
retry_forever true
</buffer>
</store>
</match>
Let's try to change the upstream configuration with https://review.opendev.org/c/openstack/kolla-ansible/+/856241.
@linwalth @nerdicbynature
Could you please provide details?
Also, please document in the release notes what the different options are introduced and what they are meant to do.
Moin,
the modification addresses multiple issues:
1) request_timeout needs to match ulk_message_request_threshold. HTTP-POST takes longer for a bigger ulk_message_request_threshold, hence the timeout should be significant higher than the usual upload time to ES. In our case 15MB usually need about 5 seconds, but sometimes need 15s.
2) retry_max_interval: Fluentd uses exponential backoff. If target ES has been configured to enforce an incoming rate limit, a series of failed HTTP-Uploads (maybe due to (1)) may lead to a buffer size, that always tops the rate limit and Fluentd does not recover from it.
3) retry_forever/reload_connections/reload_after: Fluentd sometimes silently drops the connection and gets stuck without any obvious reason. This params may help to reduce that. But actually does not prevengt Fluentd from getting stuck. Maybe it's a false assumption. Reloading connections may be a good idea tho.
Kind regards, André.
Is this still being pursued?
We now have a new Fluentd version. I'm closing this because I think it's no longer relevant.
Rolling out FluentD per osism-kolla common role results in the fluentD building up log buffers, but not properly reducing them by sending them to the ES. Instead, buffer files keep building up unseemingly in the fluentd-data volume.
Restarting the FluentD helps for a minute, raising the transmission rate to ES, but then it gets stuck again.
Internally, the container uses up 100% of CPU on the FluentD process. We experimented by raising the thread number for the process to 8 threads, which lowers CPU usage, but does not meaningfully change transmission rate. Instead, the buffer keeps growing and creating more files.
A potential cause would be https://github.com/fluent/fluentd/issues/3817 or https://github.com/uken/fluent-plugin-elasticsearch/issues/909 but then again i would suspect more people than just us running into this problem with kolla.