Loki 1.2.0 crash - Githubissues

till commented 4 years ago

Describe the bug

Loki & promtail daemons crash eventually.

To Reproduce

Steps to reproduce the behavior:

Started Loki (SHA or version): 1.2.0
Started Promtail (SHA or version): 1.2.0

I have a very basic setup — everything is stored on localhost.

When I look at the logs of loki, I see lots of messages about out of order events. I believe this happens as I currently have persistence configured for journald (to /var/log/journal) and whenever it starts up, it tries to ingest the entire directory, not just the most recent/current journal.

Another thing I noticed is that eventually the ingester is gone. But I haven't been able to track as to why this happened. Maybe the amount of journal is the problem? It's roughly 700 MB, the VM has plenty of RAM (8 GiB) and 4 vCPUs.

Promtail crashes eventually as well, after — I think — too many "400" errors from loki, which I guess is related, but also undesired behaviour. Should I make another ticket for that?

Expected behavior

I expect them to be running and working.

Environment:

Infrastructure: VM (on bare-metal HV)
Deployment tool: Ansible
OS: CentOS7 (CentOS Linux release 7.7.1908 (Core), kernel: 3.10.0)

Screenshots, Promtail config, or terminal output

Loki config:

auth_enabled: True

server:
  http_listen_address: 127.0.0.1
  http_listen_port: 3100
  grpc_listen_address: 127.0.0.1
  log_level: debug

ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
  - from: 2018-04-15
    store: boltdb
    object_store: filesystem
    schema: v9
    index:
      prefix: loki_index_
      period: 168h

storage_config:
  boltdb:
    directory: /tmp/loki/index

  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h

chunk_store_config:
  max_look_back_period: 840h

table_manager:
  chunk_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  index_tables_provisioning:
    inactive_read_throughput: 0
    inactive_write_throughput: 0
    provisioned_read_throughput: 0
    provisioned_write_throughput: 0
  retention_deletes_enabled: true
  retention_period: 840h

Promtail config:

server:
  http_listen_adress: 127.0.0.1
  http_listen_port: 9080
  grpc_listen_address: 127.0.0.1
  grpc_listen_port: 0
  log_level: debug

# Positions
positions:
  filename: /tmp/positions.yaml

# Loki Server URL
clients:
  - url: http://127.0.0.1:3100/loki/api/v1/push
    tenant_id: till
    external_labels:
      node: fully.qualified.domain.name
      customer: till

scrape_configs:
  - job_name: journal
    journal:
      path: /var/log/journal
      labels:
        job: journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

till commented 4 years ago

So, right around the time loki crashes:

level=warn ts=2020-01-19T10:38:52.826048866Z caller=logging.go:49 traceID=50617ce4c81f6f61 msg="POST /loki/api/v1/push (500) 170.786µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Content-Length: 748; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Scope-Orgid: till; "

And I think this ultimately causes promtail to crash:

level=error ts=2020-01-19T10:38:37.4948318Z caller=pool.go:170 msg="error removing stale clients" err="too many failed ingesters"

I see multiple of these. I adjusted my systemd units, so both come back up and continue. Is there anything I should look for? "Show context" is only semi useful.

cyriltovena commented 4 years ago

Loki seems to not be ready, causing promtail to backoff. Not sure why you are having issue with Loki though, you should dig into that.

till commented 4 years ago

@cyriltovena this doesn’t happen right away, but only after a couple hours, do you have any suggestions how to dig into it?

slim-bean commented 4 years ago

@till there is a max_age configuration for the journal scraper to limit how far back in time it will look.

But more importantly you should probably find a directory outside of /tmp to store your files as on most operating systems this is deleted on a restart.

The current position in the journal is saved in the positions file which should stop it from re-reading the entire contents of the journal on each start, but if this file is getting deleted on reboot you will have to read from the begging.

As @cyriltovena mentioned though, you should check your kernel logs and your loki logs to see if you can figure out why Loki stops running as well.

till commented 4 years ago

@slim-bean thank you!

I'll look into max_age, and I agree: good idea on /tmp — would you mind a PR to your docs to change that? I mostly copy and pasted to get it off the ground.

till commented 4 years ago

I wouldn't know why it hits this limit. I feel like loki and promtail may be doing that to themselves. They run on the server where I am collecting journald from.

Also, I guess this is a log stacked in a log, in a log? Judging from the error message and attempts to escape the {}?

Jan 21 19:38:46 foo promtail[31256]: level=error ts=2020-01-21T18:38:46.32022603Z caller=client.go:231 component=client host=127.0.0.1:3100 msg="final error sending batch" status=429 error="server returned HTTP status 429 Too Many Requests (429): ingestion rate limit (4194304) exceeded while adding 4 lines"
Jan 21 19:38:46 foo promtail[31256]: level=debug ts=2020-01-21T18:38:46.328552651Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg="finished processing log line" labels="{job=\"journal\", unit=\"promtail.service\"}" time=2020-01-21T14:35:24.34171+01:00 entry="\\\\\\\\\\\\\\\\\\\\\\\\\\\" duration_s=9.14e-07" duration_s=1.077e-06
Jan 21 19:38:46 foo promtail[31256]: level=debug ts=2020-01-21T18:38:46.328649481Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg="finished processing log line" labels="{job=\"journal\", unit=\"promtail.service\"}" time=2020-01-21T14:35:24.341974+01:00 entry="level=debug ts=2020-01-21T13:35:24.333005376Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg=\"finished processing log line\" labels=\"{job=\\\"journal\\\", unit=\\\"loki.service\\\"}\" time=2020-01-21T14:35:23.841745+01:00 entry=\"level=debug ts=2020-01-21T13:35:23.840608232Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=68.952µs msg=\\\"gRPC (success)\\\"\" duration_s=9.65e-07" duration_s=8.67e-07
Jan 21 19:38:46 foo promtail[31256]: level=debug ts=2020-01-21T18:38:46.328733648Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg="finished processing log line" labels="{job=\"journal\", unit=\"promtail.service\"}" time=2020-01-21T14:35:24.342199+01:00 entry="level=error ts=2020-01-21T13:35:24.333314332Z caller=client.go:231 component=client host=127.0.0.1:3100 msg=\"final error sending batch\" status=429 error=\"server returned HTTP status 429 Too Many Requests (429): ingestion rate limit (4194304) exceeded while adding 3 lines\"" duration_s=7.46e-07
Jan 21 19:38:46 foo promtail[31256]: level=debug ts=2020-01-21T18:38:46.328820746Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg="finished processing log line" labels="{job=\"journal\", unit=\"promtail.service\"}" time=2020-01-21T14:35:24.34242+01:00 entry="level=debug ts=2020-01-21T13:35:24.333415616Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg=\"finished processing log line\" labels=\"{job=\\\"journal\\\", unit=\\\"loki.service\\\"}\" time=2020-01-21T14:35:23.842046+01:00 entry=\"level=debug ts=2020-01-21T13:35:23.840886512Z caller=logging.go:44 traceID=3c1c6234af9a8c4c msg=\\\"POST /loki/api/v1/push (204) 1.681578ms\\\"\" duration_s=6.86e-07" duration_s=7.5e-07

The logs look weird. 👆

There are also a ton of these (I added a few \n here):

Jan 21 19:38:46 foo promtail[31256]: level=debug ts=2020-01-21T18:38:46.329061078Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg="finished processing log line" labels="{job=\"journal\", unit=\"promtail.service\"}" time=2020-01-21T14:35:24.342693+01:00 entry="level=debug ts=2020-01-21T13:35:24.333598233Z caller=pipeline.go:97 component=journal_pipeline component=pipeline msg=\"finished processing log line\" labels=\"{job=\\\"journal\\\", unit=\\\"promtail.service\\\"}\" time=2020-01-21T14:35:23.842427+01:00 entry=\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
...

till commented 4 years ago

So far journalctl -k has nothing. Where else would you like me to look?

slim-bean commented 4 years ago

i would turn off debug logging in promtail to start, it's incredibly verbose and I would only turn it on if you are actively debugging

cyriltovena commented 4 years ago

There must more info in Loki logs related to that unhealthy ingester, you’re getting throttled too so you can look at increasing limits

till commented 4 years ago

Turned off promtail, and now this is what stands out:

Jan 21 21:48:16 foo loki[31392]: level=error ts=2020-01-21T20:48:16.032551518Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:16 foo loki[31392]: level=error ts=2020-01-21T20:48:16.225725369Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:16 foo loki[31392]: level=error ts=2020-01-21T20:48:16.59779426Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:17 foo loki[31392]: level=error ts=2020-01-21T20:48:17.189600192Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:18 foo loki[31392]: level=error ts=2020-01-21T20:48:18.013711342Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:20 foo loki[31392]: level=error ts=2020-01-21T20:48:20.330454261Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:24 foo loki[31392]: level=error ts=2020-01-21T20:48:24.228279067Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:27 foo loki[31392]: level=error ts=2020-01-21T20:48:27.686594299Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:32 foo loki[31392]: level=error ts=2020-01-21T20:48:32.19312131Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:36 foo loki[31392]: level=error ts=2020-01-21T20:48:36.67301656Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 21 21:48:36 foo loki[31392]: level=error ts=2020-01-21T20:48:36.673044604Z caller=lifecycler.go:659 msg="Failed to transfer chunks to another ingester" err="terminated after 10 retries"
Jan 22 01:21:39 foo loki[21569]: level=error ts=2020-01-22T00:21:39.504564159Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:39 foo loki[21569]: level=error ts=2020-01-22T00:21:39.689093456Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:40 foo loki[21569]: level=error ts=2020-01-22T00:21:40.04620463Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:40 foo loki[21569]: level=error ts=2020-01-22T00:21:40.739517917Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:41 foo loki[21569]: level=error ts=2020-01-22T00:21:41.799068605Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:44 foo loki[21569]: level=error ts=2020-01-22T00:21:44.662210981Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:49 foo loki[21569]: level=error ts=2020-01-22T00:21:49.416478304Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:53 foo loki[21569]: level=error ts=2020-01-22T00:21:53.84908402Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:21:57 foo loki[21569]: level=error ts=2020-01-22T00:21:57.67657362Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:22:02 foo loki[21569]: level=error ts=2020-01-22T00:22:02.171713361Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:22:02 foo loki[21569]: level=error ts=2020-01-22T00:22:02.171739512Z caller=lifecycler.go:659 msg="Failed to transfer chunks to another ingester" err="terminated after 10 retries"

It looks like the ingester crashed. I am wondering if I should run individual processes and hope that I can catch it crashing, or is there a better way?

cyriltovena commented 4 years ago

This means loki is trying to shutdown, still I don’t see any reason. Are you running more than a single replica? You should not.

till commented 4 years ago

@cyriltovena I realize that a single process does not provide HA or fault tolerance. My idea was that for lab purposes and before I go full-on and provision a bunch of infrastructure, I run a single instance of it to test and evaluate it. Are you suggesting a single instance is not at all possible?

This is is the full cycle — there is nothing though why the ingester goes away in the first place. Which is worrying?

Log (with annotations)

Error is elavated:

Jan 22 01:22:01 foo loki[21569]: level=warn ts=2020-01-22T00:22:01.911696611Z caller=logging.go:49 traceID=2f533f5aa6b6e9ed msg="POST /loki/api/v1/push (500) 363.278µs Response: \"at least 1 live ingesters required, could only find 0\\n\" ws: false; Content-Length: 8680; Content-Type: application/x-protobuf; User-Agent: Go-http-client/1.1; X-Scope-Orgid: till; "

Apparently something is gone:

Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171545225Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=0
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171619654Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1858 value="\"\\xb5*\\xf0F\\n<\\n foo\""
Jan 22 01:22:02 foo loki[21569]: level=error ts=2020-01-22T00:22:02.171713361Z caller=transfer.go:186 msg="transfer failed" err="cannot find ingester to transfer chunks to: no pending ingesters"
Jan 22 01:22:02 foo loki[21569]: level=error ts=2020-01-22T00:22:02.171739512Z caller=lifecycler.go:659 msg="Failed to transfer chunks to another ingester" err="terminated after 10 retries"

Wait, I thought it was gone already — why would it exit now?

Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171843557Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171888052Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171887226Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171905281Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171902082Z caller=flush.go:174 msg="flushing stream" userid=till fp=e0e548bc2b7dd371 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171931451Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.17194257Z caller=flush.go:174 msg="flushing stream" userid=till fp=9b161b1caa9875d9 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171946459Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171989281Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.172007961Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.172028904Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.172043288Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.172082603Z caller=flush.go:174 msg="flushing stream" userid=till fp=f4e834568b8b9fc7 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.17367119Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.173721679Z caller=flush.go:174 msg="flushing stream" userid=till fp=2fcf8ee929f7d34c immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.17461308Z caller=flush.go:174 msg="flushing stream" userid=till fp=16c90c4c2a48fa44 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.171923408Z caller=flush.go:174 msg="flushing stream" userid=till fp=5576da15c4107576 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.178018353Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.178610299Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.179052352Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.180154609Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.180909522Z caller=flush.go:174 msg="flushing stream" userid=till fp=3243a09267d841f1 immediate=true
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.180944815Z caller=flush.go:163 msg="Ingester.flushLoop() exited"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.181008489Z caller=lifecycler.go:678 msg="unregistering member from ring"
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.181049839Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=0
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.181075215Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1858 value="\"\\xb5*\\xf0F\\n<\\n foo\""
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.18114192Z caller=mock.go:86 msg=CAS key=collectors/ring modify_index=1858 value="\"\\x00\""

Now it's removing it from Consul (which I definitely haven't installed):

Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181171345Z caller=lifecycler.go:438 msg="ingester removed from consul"
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181188953Z caller=lifecycler.go:326 msg="member.loop() exited gracefully"
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181205763Z caller=loki.go:157 msg=stopping module=querier
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181223102Z caller=loki.go:157 msg=stopping module=ring
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181234791Z caller=loki.go:157 msg=stopping module=store
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.181243658Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1859 value="\"\\x00\""
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181279468Z caller=loki.go:157 msg=stopping module=table-manager
Jan 22 01:22:02 foo loki[21569]: level=debug ts=2020-01-22T00:22:02.18128541Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=1859
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.181300445Z caller=loki.go:157 msg=stopping module=overrides
Jan 22 01:22:02 foo loki[21569]: level=info ts=2020-01-22T00:22:02.18131291Z caller=loki.go:157 msg=stopping module=server

This is when I get an alert, and it resolves:

Jan 22 01:22:02 foo systemd[1]: Stopped Loki Log Aggregator.
Jan 22 01:22:02 foo systemd[1]: Started Loki Log Aggregator.

slim-bean commented 4 years ago

All those logs still indicate Loki is shutting down. How are you starting Loki?

cyriltovena commented 4 years ago

I’m just telling you scaling the single binary does not work. It’s like Prometheus. The micro service can be scale.

How is you config ? How are you hosting it ? anything on this host that would suggest request for termination?

till commented 4 years ago

@slim-bean I am starting it using systemd:

[Unit]
Description=Loki Log Aggregator
Documentation=https://github.com/grafana/loki
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/loki -config.file /etc/loki/loki.yml
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

@CyrilPeponnet it's running in a "Container" (~similar to OpenVZ) on a HV. The HV is running Virtuozzo7. The container/VM has no resource issues — according to Prometheus/node-exporter. :-)

My config is on the top of this thread. I've adjusted two things since I started this ticket: log_level (a couple times) and I moved everything to /var/lib/loki (vs. /tmp).

cyriltovena commented 4 years ago

The config looks good. I'm pretty sure Loki is getting signaled to shutdown by the host. Are you using some sort of healthiness probe ?

till commented 4 years ago

@cyriltovena no. Any idea where I would determine this?

till commented 4 years ago

Hmm, I see what you mean. When I search the previous minute I see that systemd is shutting it down, but I am not sure why. (Still looking.)

The log from the previous minute:

Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.3700528Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=96.679µs msg="gRPC (success)"
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.370292003Z caller=logging.go:44 traceID=232645115b5ac6a9 msg="POST /loki/api/v1/push (204) 1.286727ms"
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863100262Z caller=flush.go:174 msg="flushing stream" userid=till fp=3243a09267d841f1 immediate=false
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863174955Z caller=flush.go:174 msg="flushing stream" userid=till fp=16c90c4c2a48fa44 immediate=false
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863684708Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=0
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863719211Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1850 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863792125Z caller=mock.go:86 msg=CAS key=collectors/ring modify_index=1850 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863816738Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1851 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:29 foo loki[21569]: level=debug ts=2020-01-22T00:21:29.863855084Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=1851
Jan 22 01:21:30 foo loki[21569]: level=debug ts=2020-01-22T00:21:30.469404289Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=83.895µs msg="gRPC (success)"
Jan 22 01:21:30 foo loki[21569]: level=debug ts=2020-01-22T00:21:30.469684911Z caller=logging.go:44 traceID=20cfc85696fc782a msg="POST /loki/api/v1/push (204) 1.047569ms"
Jan 22 01:21:31 foo loki[21569]: level=debug ts=2020-01-22T00:21:31.569215127Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=51.141µs msg="gRPC (success)"
Jan 22 01:21:31 foo loki[21569]: level=debug ts=2020-01-22T00:21:31.569476589Z caller=logging.go:44 traceID=2bc48903934767b0 msg="POST /loki/api/v1/push (204) 816.223µs"
Jan 22 01:21:32 foo loki[21569]: level=debug ts=2020-01-22T00:21:32.669291353Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=65.635µs msg="gRPC (success)"
Jan 22 01:21:32 foo loki[21569]: level=debug ts=2020-01-22T00:21:32.669562623Z caller=logging.go:44 traceID=ea9f2fc2e36f737 msg="POST /loki/api/v1/push (204) 938.221µs"
Jan 22 01:21:33 foo loki[21569]: level=debug ts=2020-01-22T00:21:33.760008313Z caller=logging.go:44 traceID=6a855fd81a7ad361 msg="GET /metrics (200) 3.215056ms"
Jan 22 01:21:33 foo loki[21569]: level=debug ts=2020-01-22T00:21:33.76910656Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=54.53µs msg="gRPC (success)"
Jan 22 01:21:33 foo loki[21569]: level=debug ts=2020-01-22T00:21:33.769350935Z caller=logging.go:44 traceID=44458dd1f78308c5 msg="POST /loki/api/v1/push (204) 768.753µs"
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.863502845Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=0
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.863580091Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1851 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.863688323Z caller=mock.go:86 msg=CAS key=collectors/ring modify_index=1851 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.863734565Z caller=mock.go:159 msg=Get key=collectors/ring modify_index=1852 value="\"\\xb3*\\xf0C\\n:\\n foo\""
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.863823658Z caller=mock.go:113 msg=Get key=collectors/ring wait_index=1852
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.869401788Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=70.758µs msg="gRPC (success)"
Jan 22 01:21:34 foo loki[21569]: level=debug ts=2020-01-22T00:21:34.869682946Z caller=logging.go:44 traceID=7ef5e3df22fb15cf msg="POST /loki/api/v1/push (204) 1.087953ms"
Jan 22 01:21:35 foo loki[21569]: level=debug ts=2020-01-22T00:21:35.969284945Z caller=grpc_logging.go:40 duration=79.35µs method=/logproto.Pusher/Push msg="gRPC (success)"
Jan 22 01:21:35 foo loki[21569]: level=debug ts=2020-01-22T00:21:35.969550765Z caller=logging.go:44 traceID=3e65ae166b69d279 msg="POST /loki/api/v1/push (204) 905.004µs"
Jan 22 01:21:37 foo loki[21569]: level=debug ts=2020-01-22T00:21:37.06967791Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=84.702µs msg="gRPC (success)"
Jan 22 01:21:37 foo loki[21569]: level=debug ts=2020-01-22T00:21:37.069981025Z caller=logging.go:44 traceID=201567dde696c236 msg="POST /loki/api/v1/push (204) 1.196082ms"
Jan 22 01:21:38 foo loki[21569]: level=debug ts=2020-01-22T00:21:38.169341267Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=66.647µs msg="gRPC (success)"
Jan 22 01:21:38 foo loki[21569]: level=debug ts=2020-01-22T00:21:38.169623348Z caller=logging.go:44 traceID=ab341dc1311ac9b msg="POST /loki/api/v1/push (204) 921.206µs"
Jan 22 01:21:39 foo loki[21569]: level=debug ts=2020-01-22T00:21:39.269324878Z caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=78.049µs msg="gRPC (success)"
Jan 22 01:21:39 foo loki[21569]: level=debug ts=2020-01-22T00:21:39.269529875Z caller=logging.go:44 traceID=38aeb93c3e63db2a msg="POST /loki/api/v1/push (204) 861.822µs"
Jan 22 01:21:39 foo systemd[1]: Stopping Loki Log Aggregator...
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.503905379Z caller=signals.go:54 msg="=== received SIGINT/SIGTERM ===\n*** exiting"
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.503997096Z caller=loki.go:175 msg="notifying module about stopping" module=all
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504037402Z caller=loki.go:175 msg="notifying module about stopping" module=distributor
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504053725Z caller=loki.go:175 msg="notifying module about stopping" module=ingester
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504074298Z caller=loki.go:175 msg="notifying module about stopping" module=querier
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504089073Z caller=loki.go:175 msg="notifying module about stopping" module=ring
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504101715Z caller=loki.go:175 msg="notifying module about stopping" module=store
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504111054Z caller=loki.go:175 msg="notifying module about stopping" module=table-manager
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504125108Z caller=loki.go:175 msg="notifying module about stopping" module=overrides
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504139301Z caller=loki.go:175 msg="notifying module about stopping" module=server
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504152862Z caller=loki.go:157 msg=stopping module=all
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504172896Z caller=loki.go:157 msg=stopping module=distributor
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504189606Z caller=loki.go:157 msg=stopping module=ingester
Jan 22 01:21:39 foo loki[21569]: level=info ts=2020-01-22T00:21:39.504213366Z caller=lifecycler.go:632 msg="changing ingester state from" old_state=ACTIVE new_state=LEAVING

till commented 4 years ago

Nothing. I'll deploy with Environment=SYSTEMD_LOG_LEVEL=debug hoping to see more.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

till commented 4 years ago

I can confirm that this was something related to the HV we are running on and the virtualisation technology we use. The system effectively killed everything doing a lot of i/o. Closing this. Thanks for the help.

slim-bean commented 4 years ago

Thanks for the follow up! Glad you were able to figure out the cause!

grafana / loki

Loki 1.2.0 crash #1547

Log (with annotations)