grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.39k stars 3.39k forks source link

Missing logs in Loki #4941

Open pvlltvk opened 2 years ago

pvlltvk commented 2 years ago

Describe the bug I use Fluent Bit which sends logs from my Kubernetes nodes to Loki and Newrelic simultaneously. I found out that some logs that I could find in Newrelic were missing in Loki. I also can find those missing logs on my Kubernetes nodes or using kubectl logs.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki Distributed v2.4.1 with S3 and BoltDB Shipper backend
  2. Started Fluent Bit v1.8.6

Expected behavior I expect all logs to be available in Loki.

Environment:

Screenshots, Promtail config, or terminal output My Loki config:

Click to expand ``` auth_enabled: false server: http_listen_port: 3100 grpc_server_max_recv_msg_size: 104857600 grpc_server_max_send_msg_size: 104857600 distributor: ring: kvstore: store: memberlist memberlist: join_members: - loki-loki-distributed-memberlist ingester: lifecycler: ring: kvstore: store: memberlist replication_factor: 3 chunk_idle_period: 1h chunk_target_size: 1536000 max_chunk_age: 1h max_transfer_retries: 0 wal: enabled: true dir: /var/loki/wal replay_memory_ceiling: 2GB limits_config: enforce_metric_name: false reject_old_samples: true reject_old_samples_max_age: 168h max_cache_freshness_per_query: 10m retention_period: 2232h retention_stream: - selector: '{cluster_name="staging"}' priority: 1 period: 168h schema_config: configs: - from: 2021-11-18 store: boltdb-shipper object_store: aws schema: v11 index: prefix: loki_index_ period: 24h storage_config: aws: bucketnames: loki-logging-data endpoint: https://storage.endpoint.net region: eu-central-1 access_key_id: access_key_id secret_access_key: secret_access_key insecure: false boltdb_shipper: active_index_directory: /var/loki/index shared_store: s3 cache_location: /var/loki/cache index_gateway_client: server_address: dns:///loki-loki-distributed-index-gateway:9095 index_queries_cache_config: redis: endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379 master_name: loki expiration: 6h db: 0 password: password timeout: 1000ms chunk_store_config: chunk_cache_config: redis: endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379 master_name: loki expiration: 6h db: 1 password: password timeout: 1000ms querier: query_timeout: 5m max_concurrent: 48 query_range: # make queries more cache-able by aligning them with their step intervals align_queries_with_step: true max_retries: 5 # parallelize queries in 15min intervals split_queries_by_interval: 15m cache_results: true results_cache: cache: redis: endpoint: loki-redis-node-0.loki-redis-headless:26379,loki-redis-node-1.loki-redis-headless:26379,loki-redis-node-2.loki-redis-headless:26379 master_name: loki expiration: 6h db: 2 password: password timeout: 1000ms frontend_worker: frontend_address: loki-loki-distributed-query-frontend-grpclb:9095 parallelism: 12 frontend: log_queries_longer_than: 5s compress_responses: true tail_proxy_url: loki-loki-distributed-querier:3100 compactor: retention_enabled: true ```

My fluent-bit config:

Click to expand ``` [SERVICE] Flush 1 Daemon Off Log_Level warn Parsers_File parsers.conf Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /var/log/fluent-storage/ storage.sync normal storage.checksum off storage.backlog.mem_limit 16M storage.metrics on [INPUT] Name tail Path /var/log/containers/*.log Parser docker Tag kube.* Skip_Long_Lines On Mem_Buf_Limit 64M storage.type filesystem [INPUT] Name systemd Tag host.* Read_From_Tail On Mem_Buf_Limit 16M storage.type filesystem [FILTER] Name record_modifier Match * Record cluster_name infra Record environment infra [FILTER] Name record_modifier Match host.* Record log_type system [FILTER] Name record_modifier Match kube.* Record log_type kubernetes [FILTER] Name kubernetes Match kube.* Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On [OUTPUT] Name loki Match kube.* host gateway-loki.foo.com port 443 tls on tls.verify on labels $cluster_name, $environment, $log_type, $kubernetes['namespace_name'], $kubernetes['container_name'] storage.total_limit_size 512M Retry_Limit False workers 1 [OUTPUT] Name loki Match host.* host gateway-loki.foo.com port 443 tls on tls.verify on labels $cluster_name, $environment, $log_type storage.total_limit_size 512M Retry_Limit False workers 1 [OUTPUT] Name nrlogs Match * license_key ${API_KEY} storage.total_limit_size 512M Retry_Limit False workers 1 ```

There are also some errors from the Loki ingester service:

msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/8c86bbf8e29cca2b%3A17dbfea5a9d%3A17dbfed6b14%3Ae86971ff\": http: server closed idle connection"
msg="failed to flush user" err="RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
msg="failed to flush user" err="RequestError: send request failed\ncaused by: Put \"https://loki-logging-data.storage.yandexcloud.net/fake/81718d22e0c8b67d%3A17dc0c3a00c%3A17dc0fadcbb%3Aaa9ab989\": http: server closed idle connection"
rlex commented 2 years ago

using minio by any chance?

DrissiReda commented 2 years ago

Same problem, using minio.

pvlltvk commented 2 years ago

@rlex @DrissiReda Sorry, I forgot to give an update. I don't use Minio. In my case the problem was solved by replacing Fluent Bit with Promtail.

DrissiReda commented 2 years ago

Didn't you figure out the problem with fluentbit? because that's what I'm using but I can't change it since I use it for other stuff

pvlltvk commented 2 years ago

@DrissiReda No, I didn't. Maybe I'll test it again when i have more time.

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

korenlev commented 2 years ago

seeing the same with latest loki and latest fluentbit , per doc 'loki output' config, and lots of lots of missing logs

Ruppsn commented 2 years ago

We have the same situation. With promtail we see all logs, with fluentbit-loki we are missing some. It seams to me that some streams are broken after some time

irizzant commented 2 years ago

see https://github.com/grafana/loki/issues/4221

data-dude commented 2 years ago

I'm running Loki 2.6.1 and Fluentbit 1.9.5 and I'm missing logs. There are no error messages. Sometimes they are there and sometimes they aren't. I guess the workaround is to use promtail. Unfortunately promtail uses a lot of CPU at times. oh well

My workaround is to use the fluent-bit-loki-plugin

nlnjnj commented 2 years ago

Have the same situation.

chadgeary commented 1 year ago

seeing the same. fluentbit->cloudwatch/loki; cloudwatch has everything, loki does not.

maxramqvist commented 1 year ago

Same or similar, we run latest Loki and a fluent-bit (also promtail) with Minio as storage... But in our case different Loki instances return different data. Consistently the same data from the same Loki-instance, though.

khanh96le commented 1 year ago

Seeing a similar issue. Vector --> ElasticSearch/Loki. ElasticSearch has full logs, Loki does not

tkblack commented 1 year ago

Similar problem. I use efk and grafana loki with filesystem as storage. ElasticSearch has full logs, but grafana loki display 2h old logs only. Loki version: 2.6.1

suckatrash commented 8 months ago

I came across this issue having a similar problem with Vector.

We have a set of hosts sending logs from just a few services, and two of those hosts are test hosts where not much is logged. It seems like the two quiet hosts disappear entirely after a while (a query with {host="foo"} has no recent hits), while the ones that generate more frequent logs remain query-able.

Looking at the errors in the original report, it seems like Loki closes idle connections after a while so that clients need to reestablish the connection after a certain period of idleness? Maybe there's a client-side setting that will periodically wake up the connection.

satyamsundaram commented 6 months ago

Similar problem. We use fluent-bit to send logs to both fluentd and loki. Fluentd has all the logs but loki is missing some. There are no error logs in either Loki or Fluent-bit related to this. Has anybody found the issue or fixed this without switching to promtail for log collection?