grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.86k stars 3.44k forks source link

if loki is not reachable and loki-docker-driver is activated, containers apps stops and cannot be stopped/killed #2361

Open badsmoke opened 4 years ago

badsmoke commented 4 years ago

Describe the bug we have installed the loki-docker-driver on all our devices. The loki server on an extra server, if the loki-server is updated/restarted or just not reachable then after a short time all containers get stuck (docker logs does not update anymore). If the loki-server is not reachable, the containers can neither be stopped/kill nor restarted.

To Reproduce Steps to reproduce the behavior:

  1. start loki server (server)
  2. install loki-docker-driver on another system (can also be tested on one and the same system) (client) 2.1. /etc/docker/daemon.json { "live-restore": true, "log-driver": "loki", "log-opts": { "loki-url": "http://loki:3100/api/prom/push", "mode": "non-blocking", "loki-batch-size": "400", "max-size": "1g" } }
  3. docker run --rm --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"(client)
  4. docker exec -it der-container tail -f /tmp/ts shows every second the time (client)
  5. docker logs -f der-container show numbers from 0-1000000 (client)
  6. stop loki server (server)
  7. you will see that the outputs on the system stop with the loci-driver and that you cannot stop the container (client)
  8. docker stop der-container (client)

Expected behavior A clear and concise description of what you expected to happen. I would like all containers to continue to run as desired even if the loci is not accessible. That man container can start/stop even if loki is not reachable

Environment:

Screenshots, Promtail config, or terminal output loki-docker-driver version: loki-docker-driver:master-616771a (from then on the driver option "non-blocking" is supported) loki server: 1.5.0

I am very grateful for any help, this problem has caused our whole system to collapse

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

badsmoke commented 4 years ago

:-(

rkno82 commented 4 years ago

This issue is being closed without any comment/ feedback?

For me/ us this is a major issue/ blocker.

@owen-d Can you please comment? Thank you!

ondrejmo commented 4 years ago

2017 fixed the same problem for me

rndmh3ro commented 4 years ago

2017 fixed the same problem for me

Do you mean setting the non-blocking mode? The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.

rndmh3ro commented 4 years ago

I could reproduce the problem:

root@loki # docker run -d --log-driver=loki     --log-opt loki-url="http://172.29.95.195:3101/loki/api/v1/push"     --log-opt loki-retries=5     --log-opt loki-batch-size=400 --log-opt mode=non-blocking  --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"```

Running loki and the above client-container, then stopping loki, the client-container fails:


error from daemon in stream: Error grabbing logs: error decoding log message: net/http: request canceled (Client.Timeout exceeded while reading body)```
ondrejmo commented 4 years ago

2017 fixed the same problem for me

Do you mean setting the non-blocking mode? The OP stated that they set the mode to non-blocking but it still does not work. I'll have to try it tomorrow.

Yeah I meant the non-blocking mode, I haven't noticed it in the original issue, sorry.

rkno82 commented 4 years ago

No response? 😢

Pandry commented 4 years ago

Hi, We are testing Loki for our architecture, and I encountered this issue too

I found out that the time needed to stop a container (any container) has "penalty" between 5 and 15 minutes when loki is the logging driver and the destination server (either loki or promtail) is unreachable. In our testing architecture, we have the docker log driver that pushes the logs to the promtail container, and promtail that pushes the logs to the loki server (I tought (promtail cached and so) it could have been a good idea)

+-----------------------+   +--------------------+
|    Virtual Machine 01 |   | Virtual Machine 02 |
|                       |   |                    |
|   +------+--------+   |   |                    |
|   |Loki  | Docker |   |   |                    |
|   |DRIVER|        |   |   |                    |
|   +-+---++        |   |   |                    |
|   | ^   |         |   |   | +--------+         |
|   | | +-v------+  |   |   | | Loki   |         |
|   | | |Promtail+----------->+ Server |         |
|   | | +--------+  |   |   | |        |         |
|   | |             |   |   | +--------+         |
|   | +-------+     |   |   |                    |
|   | | NGINX |     |   |   |                    |
|   | +-------+     |   |   |                    |
|   +---------------+   |   |                    |
|                       |   |                    |
+-----------------------+   +--------------------+

At the moment we are trying with the mode: non-blocking mode, and, other than slowing down the stop of the promtail container itself, it seems to be ok with the other containers but it's not working anyway.

Is there any viable fix available at the moment?

kavirajk commented 4 years ago

I'm investigating!

you can even reproduce by directly start any container with loki logger and some unreachable loki-url,

  1. with local log driver

    docker run --log-driver local --log-opt max-size=10m alpine ping 127.0.0.1
  2. with loki log driver

    docker run --log-driver loki --log-opt loki-url="http://172.17.0.1:3100/loki/api/v1/push" alpine ping 127.0.0.1

case 1, you can stop/kill container case 2, you can stop/kill container only after 5 mins or so

docker daemon log is not that useful either.

level=warn ts=2020-10-28T11:55:05.178484441Z caller=client.go:288 container_id=eb8c67b975f20837210c638d5f83fa1fa011c183c725af337c1fad9ffb2d3a01 component=client host=172.17.0.1:3100 msg="error sending batch, will retry" status=-1 error="Post \"http://172.17.0.1:3100/loki/api/v1/push\": dial tcp 172.17.0.1:3100: connect: connection refused"
Pandry commented 4 years ago

I probably figured out the reason why it takes so much time, and I can say my suspect was true and I think this is probably an intended behavior: As we can read from the source code, the message is given inside the backoff logic loop.

If we try to start a container reducing to the (almost) minimum the backoff options, we can see the container stops (almost) immediately: docker run --log-driver loki --log-opt loki-url="http://0.0.0.0:3100/loki/api/v1/push" --log-opt loki-time out=1s --log-opt loki-max-backoff=800ms --log-opt loki-retries=2 alpine ping 127.0.0.1 (If you want to keep the log file after the container stopped, add the --log-opt keep-file=true parameter)

As far as my undestanding goes tho, if the driver is unable to send the logs withing the backoff frame, the logs will be lost (so I would consider the keep-file seriously...)

In my opinion the best thing to do would be to cache locally the logs if the client is unable to send the logs within the bakeoff window, to send them later on

kavirajk commented 4 years ago

Agree with backoff logic,

Tested with fluentd log driver, looks like same there as well, except may be fluentd have some default lower backoff time (so that container stops more quickly). And I see this on daemon log

dockerd[1476]: time="2020-10-28T17:50:12.580014937+01:00" level=warning msg="Logger didn't exit in time: logs may be truncated"

also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.

kavirajk commented 4 years ago

also 5mins time limit is from the default max-backoff we use. https://github.com/grafana/loki/blob/master/pkg/promtail/client/config.go#L19

Pandry commented 4 years ago

also another small improvement could be to add a check to see if loki-url i reachable during start of the container and fail immediately.

I disagree, as starting a service may be more important than having its log (and debugging may not be that easy) I would rather use a feature-flag and by default keep it disabled

As I said, in my opinion the best opinion would be to cache the logs and send them as soon as a Loki endpoint becomes available; In the meantime find a way to warn the user about the unreachable endpoint and cache the logs.

lux4rd0 commented 3 years ago

Agree that a better way of understanding how to maintain control over a docker container when the end-point is unavailable is critical. I've been experimenting with different architecture deployments of Loki and found that even a Kill of the docker container doesn't work. Not being able to control a shutdown/restart of a container because I can't send logs out of the Loki driver shouldn't impact my container. Will look to change my container properties defaults to get around this.

rkno82 commented 3 years ago

Maybe we should accept the behaviour of the docker driver plugin and send the logfiles to a local "kind of daemonset" promtail, which supports the loki push api?

https://grafana.com/docs/loki/latest/clients/promtail/#loki-push-api

IgorOhrimenko commented 3 years ago

Problem still actual. Preparing: Uninstalled and installed the newest loki from instruction https://grafana.com/docs/loki/latest/clients/docker-driver/

Test: First clear test. I run container by default log-driver or --log-driver=none like this: docker run --name debian-loki-test --rm debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done" Stopping very fast:

time docker stop debian-loki-test
real    0m10,391s

But when I use loki driver with fake loki address:

docker run --name debian-loki-test --rm --log-driver=loki --log-opt loki-url="http://loki.fake/loki/api/v1/push" --log-opt mode=non-blocking debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done" Stooping too long:

time docker stop debian-loki-test
real    12m9,488s

But app(seq 0 1000000) stopping early, almost at once.


When loki address is right: docker run --name debian-loki-test --rm --log-driver=loki --log-opt loki-url="http://loki.lan/loki/api/v1/push" --log-opt mode=non-blocking debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done" Stopping faster:

time docker stop debian-loki-test 
real    0m18.450s

And starting the container when loki was, but down it, and returned loki online after few minutes:

time docker stop debian-loki-test 
real    3m58.299s

when loki back online stopping was sharp.


And I test fix: docker plugin install kavirajk/loki-docker-driver:latest --alias loki-fix it does not work too.

CptDaniel commented 3 years ago

Any idea on when this will be released? Or is there a workaround so we can use loki with docker in production?

ThisDevDane commented 3 years ago

We're also experiencing this problem now, and the PR that closed this issue hasn't been part of any changelog yet so I'm guessing it hasn't been released?

kavirajk commented 3 years ago

The fix was already merged and released (should be available since > 2.0.1).

Fix should be available in grafana /loki-docker-driver. checking if anything changed recently

CptDaniel commented 3 years ago

I'm by no means a contributor or anything in this project, but it seems the changes added in https://github.com/grafana/loki/pull/2898 are added neither in 2.1.0 nor in 2.0.1 or at least there's still l.client.Stop() instead of l.client.StopNow() at https://github.com/grafana/loki/blob/v2.1.0/cmd/docker-driver/loki.go#L73 . If there is a way one can help with getting this stuff into a released version I'm more than happy to help.

kavirajk commented 3 years ago

@CptDaniel you are right. The fix is not included in any of the release yet.

But if anyone using grafana/loki-docker-driver:latest (which is sync with latest master always) should contain the fix.

I'm investigating why it doesn't work on grafana/loki-docker-driver:latest. I will keep posted.

kavirajk commented 3 years ago

Some findings.

So the docker-driver (with fix) seems to work(at least some cases)

With grafana/docker-driver:2.1.0 (without the fix) - Container takes long to stop.

-bash5.0$ docker plugin install grafana/loki-docker-driver:2.1.0 --alias loki-2.1.0

-bash5.0$ docker run --name alpine-loki-test --rm --log-driver=loki-2.1.0 --log-opt mode=non-blocking --log-opt loki-url="http://loki.fake/loki/api/v1/push" alpine ping 127.0.0.1

-bash5.0$ docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED         STATUS             PORTS                                          NAMES
59ee91811f72   alpine                     "ping 127.0.0.1"         3 seconds ago   Up 2 seconds                                                      alpine-loki-test

time docker kill 59ee91811f72
59ee91811f72

real    6m5.343s
user    0m0.049s
sys 0m0.051s

With grafana/docker-driver:latest (with the fix) - Container stops immediately

-bash5.0$ docker plugin install grafana/loki-docker-driver:latest --alias loki

-bash5.0$ docker run --name alpine-loki-test --rm --log-driver=loki --log-opt mode=non-blocking --log-opt loki-url="http://loki.fake/loki/api/v1/push" alpine ping 127.0.0.1

-bash5.0$ docker ps
CONTAINER ID   IMAGE                      COMMAND                  CREATED         STATUS         PORTS                                          NAMES
9c396baa78d8   alpine                     "ping 127.0.0.1"         3 seconds ago   Up 2 seconds                                                  alpine-loki-test

-bash5.0$ time docker kill 9c396baa78d8
9c396baa78d8

real    0m1.298s
user    0m0.036s
sys 0m0.026s

But having said all these, still when you run this process

/bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"

the fix doesn't work.

currently investigating to understand this behaviour.

CptDaniel commented 3 years ago

Hi @kavirajk, are there any new informations regarding this weird behaviour not covered by the fix?

nikitagashkov commented 3 years ago

Hello, I'm experiencing a similar issue with the Loki driver.

Containers become unresponsive and cannot be stopped. The only thing that helps to revive the system is to restart the Docker daemon itself via systemctl restart docker.

Maybe you'll find these logs to be useful:

Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="main.(*loki).Log(0xc000295f80, 0xc000441030, 0xc000092000, 0x0)" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/loki.go:69 +0x2fb" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/driver.go:165 +0x4c2" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
Apr 25 10:20:42 united-dance dockerd[339888]: time="2021-04-25T10:20:42Z" level=error msg="\t/src/loki/cmd/docker-driver/driver.go:116 +0xa75" plugin=028a661f9c18e7ca1dbad2ba43459f6217f912ae65e6a3984ade60fcaca36aff
pgassmann commented 3 years ago

I am now using vector.dev as a log collector on docker hosts. It collects the logs through the docker API and does not need a driver and does not require changing the configuration of the containers.

The following configuration will send the logs as json to loki.

/etc/vector/vector.toml

[sources.docker-local]
  type = "docker_logs"
  docker_host = "/var/run/docker.sock"
  exclude_containers = []

  # Identify zero-width space as first line of a multiline block.
  multiline.condition_pattern = '^\x{200B}' # required
  multiline.mode = "halt_before" # required
  multiline.start_pattern = '^\x{200B}' # required
  multiline.timeout_ms = 1000 # required, milliseconds

[sinks.loki]
  # General
  type = "loki" # required
  inputs = ["docker*"] # required
  endpoint = "https://loki.example.com:443" # required

  # Auth
  auth.strategy = "basic" # required
  auth.user = "username" # required
  auth.password = "asdfasdf" # required

  # Encoding
  encoding.codec = "json" # required

  # Healthcheck
  healthcheck.enabled = true # optional, default

  # Loki Labels
  labels.forwarder = 'vector'
  labels.host = '{{ host }}'
  labels.container_name = '{{ container_name }}'
  labels.compose_service = '{{ label.com\.docker\.compose\.service }}'
  labels.compose_project = '{{ label.com\.docker\.compose\.project }}'
  labels.source = '{{ stream }}'
  labels.category = 'dockerlogs'

Example Loki Query from Docker Logs Dashboard for output:

{host=~"$host",category="dockerlogs",compose_project=~"$project",compose_service=~"$service"} |~ "$search" 
| json | line_format "{{.container_name}} {{ .source }} {{.message}}"

NOTE: vector currently only reads live logs, it does not collect past logs and does not support checkpointing. so you might miss some logs when vector is stopped/restarted or started later than your services. See https://github.com/timberio/vector/issues/7358

GameBurrow commented 3 years ago

Any updates on this?

jimmy0012 commented 3 years ago

Still an issue for us

egor-spk commented 3 years ago

Any updates?

jeschkies commented 3 years ago

I investigated the issue in #4082. It also relates to https://github.com/moby/moby/issues/42705.

@Pandry is correct. What happens is the following. The Docker daemon locks when it tries to write the logs of a container. If the container is killed it waits until the logs have been drained. The Loki driver receives the logs and passes them on. If the Loki server is unreachable it retries for each batch. This can take a very long time and it seems the Docker daemon is deadlocked.

This situation is even worse when you set max_retries=0 since this will retry forever. Loki is using Cortex's backoff logic.

Unfortunately, one cannot simply restart the Loki driver as the Docker daemon will not reconnect.

I'm not sure what the expected behavior by the driver would be other than loosing logs.

However, we do have a setup were promtail is running on each node and scrapes the Docker container file logs. Let me dig up the configuration.

okaufmann commented 3 years ago

Would love to see your setup 👍

jeschkies commented 3 years ago

This is just a draft but it should pick up all Docker container logs:

        configs:
        - name: default
          positions:
            filename: /opt/grafana-agent/loki-positions.yaml
          clients:
            - url: https://....grafana.net/loki/api/v1/push
              basic_auth:
                username: '29'
                password: {{ getenv "LOGS_PUBLISH_KEY" }}
          scrape_configs:
            - job_name: system
              pipeline_stages: []
              static_configs:
              - labels:
                  job: docker
                  host: {{ getenv "HOSTNAME" }}
                  __path__: /var/lib/docker/containers/*/*-json.log

Sorry. This is a template. We should also extend the pipeline to add the container name. Let me bring this up in the team meeting.

As for the driver, I'm not sure how we can solve the problem without writing to disc first. Docker gives the driver a Unix fifo. There's no real caching.

jeschkies commented 3 years ago

I've talked to the team and update the documentation. I don't think we can currently come up with a satisfying solution in the Docker driver. It was made to avoid writing logs to disk. Thus I don't see a way not to block or drop log entries. If you can I recomend to use promtail in production as it is described in the documentation (see my PR).

durcon commented 3 years ago

@jeschkies Unfortunately, your template has a big disadvantage: No label container_name anymore.

The label filename is no substitute, because container's log files are not human readable, like:

2ea69841bd67364754e2b352c9833abd5222a6db1f252001b7fcdb35ad029c8b-json.log

Another problem is, that redeploying a container changes the container ID, so I can't search with label filename, because this label changes, too.

I can't add a label container_name (in static_configs), because I don't know the path of a specific container's log file, like:

/var/lib/docker/containers/2ea69841bd67364754e2b352c9833abd5222a6db1f252001b7fcdb35ad029c8b

After redeployment the path is changed, too.

It would help, if I could set the file name, but Docker's default log driver doesn't have such an option, see JSON File logging driver.

One way could be, to log the container name in every log line and use a pipeline to extract the container_name label. But for most loggers you have to change every log command in your application. You can't just add it with one configuration. This is a error-prone way.

Another way is to change the logging from StdOut/StdErr to file logging in the application and then add a mount to the Docker host. Promtail could scrape this mounted file. But for 10 or more Docker containers it is a lot to configure and to maintain. This is also a error-prone way.

It would be helpful, if Promtail had a Docker Discovery like Kubernetes Discovery, then Promtail could inspect the container and get container name and log path.

My conclusion: I can't use Loki Docker Driver, because it crashes the Docker engine and I can't use Promtail, because it is error-prone, so Loki isn't useful in a pure Docker environment.

jeschkies commented 3 years ago

Thanks for the feedback @durcon.

No label container_name anymore.

You can export the container name label to the JSON logger.

Loki Docker Driver, because it crashes the Docker engine

The driver does not crash the Docker engine. The Docker engine hangs until all logs are flushed. You can configure the driver to drop logs after a number of retries if Loki is down. This will avoid blocking the Docker enginer.

I can't use Promtail, because it is error-prone

Could you give some details? What errors did you encounter?

durcon commented 3 years ago

@jeschkies Thank you for your quick answer.

You can export the container name label to the JSON logger.

Can you elaborate it a little bit? I'm just trying --log-opo label=XXX, see JSON File logging driver. No success, yet.

The driver does not crash the Docker engine. The Docker engine hangs until all logs are flushed. You can configure the driver to drop logs after a number of retries if Loki is down. This will avoid blocking the Docker enginer.

Sorry, wrong wording. Sometimes for some reasons our Loki is offline for some time, so all Docker containers are not responding anymore. In most of the cases a restart of Docker daemon is working.

I can't use Promtail, because it is error-prone

Could you give some details? What errors did you encounter?

Missing container_label is a no-go. I try to solve it. I listed some work-arounds, but all are error-prone or at least less maintainable.

However, it would be nice if Grafana could solve the Docker driver problem or at least add some more support to Promtail (Docker Discovery). It would be easier to use and maintain.

BTW:

As for the driver, I'm not sure how we can solve the problem without writing to disc first.

Why is that a problem? Many users are using keep-file anyway. It's helpful that you can still use docker container logs.

jeschkies commented 3 years ago

Can you elaborate it a little bit?

If you use Docker run this should work

docker run --label name=my-container --log-driver --log-opt labels=name json-file mingrammer/flog -l -s 20

Or you can tag the logs

docker run --name my-container --log-driver --log-opt tag="{{.Name}}" json-file mingrammer/flog -l -s 20

Why is that a problem?

There might be cases when users do not want to write to disk. However, we could make it optional to flush to disk in case Loki is not reachable.

durcon commented 3 years ago

@jeschkies Thank you, again.

I could add the label:

docker run --rm --label name=test --log-driver json-file --log-opt labels=name --name der-container -d debian /bin/sh -c "while true; do date >> /tmp/ts ; seq 0 1000000; sleep 1 ; done"

Result:

{"log":"0\n","stream":"stdout","attrs":{"name":"test"},"time":"2021-10-26T14:00:24.395265092Z"}
{"log":"1\n","stream":"stdout","attrs":{"name":"test"},"time":"2021-10-26T14:00:24.395298231Z"}

I guess I still need a pipline to extract the label from nested structure. I will try it.

durcon commented 3 years ago

@jeschkies Just one point I forgot:

Warning

The json-file logging driver uses file-based storage. These files are designed to be exclusively accessed by the Docker daemon. Interacting with these files with external tools may interfere with Docker’s logging system and result in unexpected behavior, and should be avoided.

https://docs.docker.com/config/containers/logging/json-file/

That means that the Promtail solution is only a work-around with some risk.

jeschkies commented 3 years ago

I guess I still need a pipline to extract the label from nested structure

You can in a pipeline or later during a query. Do you want to use the container name as a stream label?

durcon commented 3 years ago

@jeschkies Yes, I want the container name as a stream label like the default label container_name of the Loki Docker driver.

BTW: I saw I video about Loki 2.0 with the recommendation of a query instead of a pipline. Pipelines (additional labels) would result in performance issues.

jeschkies commented 3 years ago

Pipelines (additional labels) would result in performance issues.

That's why I've asked. If you have too many different container names it should be queried otherwise you will have too many different log streams.

durcon commented 3 years ago

@jeschkies

I'm still trying to read logs with Promtail instead of Loki Docker Driver. It is really annoying, because Promtail misses support.

I installed Promtail:

https://sbcode.net/grafana/install-promtail-service/

I created a user for Promtail. This user can't read Docker's logs, because Docker runs as root. So I have to run Promtail as root. That is not really good.

Another problem is that I see JSON messages in Grafana:

2021-11-08 12:45:16 | {"log":"2021-11-08 09:32:28,651 [main] INFO org.springframework.boot.web.embedded.tomcat.TomcatWebServer - Tomcat initialized with port(s): 8080 (http)\n","stream":"stdout","time":"2021-11-08T09:32:28.668710476Z"}

With JSON messages I could not use Grafana's time range feature. I have to use piplines to extract timestamp and message.

  pipeline_stages:
  - json:
      expressions:
        message:   log
        stream:    stream
        timestamp: time
  - timestamp:
      source: timestamp
      format: RFC3339Nano
  - labels:
      stream:
  - output:
      source: message      

Could you please ask Promtail devs, why they didn't support Docker Discovery like Kubernates Discovery? Docker has a REST API to load logs. There is no need to read logs from file. Moreover, it is not recommended to do that.

See: https://docs.docker.com/engine/api/v1.41/#operation/ContainerLogs

Thank you.

jeschkies commented 3 years ago

Thanks for your feedback.

It is really annoying, because Promtail misses support.

What do you mean there is no support. We actively work on Promtail :slightly_smiling_face:

So I have to run Promtail as root. That is not really good.

You could grant the Promtail user access to the Docker logs. However, what is bad at running Promtail as root?

I have to use piplines to extract timestamp and message.

Which is the proper solution here. I think you've solved it pretty well.

Could you please ask Promtail devs, why they didn't support Docker Discovery like Kubernates Discovery?

That is a great feature request. I've created https://github.com/grafana/loki/issues/4703.

hervenicol commented 2 years ago

I was puzzled with the doc (https://grafana.com/docs/loki/latest/clients/docker-driver/#know-issues), which did not state clearly whether I should use loki-docker-driver or promtail, and which were the pros and cons of each.

To sum it up, and check if I understood correctly the current status:

loki-docker-driver, as described in the doc (https://grafana.com/docs/loki/latest/clients/docker-driver/#know-issues), blocks when loki is down. :disappointed: Various config options could mitigate this, but I'm not sure about the side effects:

So the documentation advises to use promtail. But then we lose all the automatic tagging that loki-docker-driver does. :cry: We can manually tag containers, but this is manual and error-prone. :scream: Also, files are still written to disk, which can have a performance impact and is avoided with loki-driver-docker

As of today, it looks like there's no clue on how to fix the issue, and no clear direction on whether loki-docker-driver has a future / is recommended or not. :shrug:

srstsavage commented 2 years ago

I'm watching #4911 with great interest. From what I gather using promtail with the new Docker target will be The New Way and loki-docker-driver will become somewhat discouraged?

In the meantime, I can verify that even with defensive values in the suggested log opts:

loki-retries: "2"
loki-batch-size: "5000"
loki-max-backoff: "10s"
mode: "non-blocking"

I am still observing Docker freeze ups on containers using the loki logging driver when the loki endpoint is unavailable (i.e. commands like docker logs, docker rm, docker inspect, etc hang until either the loki endpoint comes back up or the Docker daemon is restarted).

Since by default loki-logging-driver still writes logs to disk via json-file (unless no-file is true) it seems somewhat odd that it can't be adjusted to allow Docker operations when the loki endpoint is unavailable...but I suppose that's a use case that promtail is more suited to.

I do wonder how the new-ish Docker local binary log format plays into this...it seems like adjusting loki-docker-driver to play nice with this format would be an easier lift than adjusting promtail, but I haven't looked into the details there.

Edit: after looking over #4911 it seems like the local format might play nice with promtail after all, since logs will be read from the Docker socket and not parsed from files on disk?

jeschkies commented 2 years ago

I am still observing Docker freeze ups on containers using the loki logging driver when the loki endpoint is unavailable

Hm, that's interesting. The logging driver should give up on a batch and continue and eventually flush stdout and stderr of a container. I'm wondering if there is an issue lurking here that we didn't think of.

Since by default loki-logging-driver still writes logs to disk via json-file

I'm not sure this is true. Could you point to a config or code supporting that statement?

since logs will be read from the Docker socket and not parsed from files on disk?

The change #4911 will read from the Docker socket. However, according to the API docs "This endpoint works only for containers with the json-file or journald logging driver." This is very sensible as it will allow for Promtail to pick up logs on a later point in time which is exactly what we want in case Loki is not reachable.

jeschkies commented 2 years ago

Ok, I've got the Docker service discovery working. It'll fetch the logs via the Docker daemon API. For the adventurous of you, please checkout my changes https://github.com/grafana/loki/pull/4911 and try them out :slightly_smiling_face:

llacroix commented 2 years ago

@jeschkies from my understanding, it's allowing promtail to discover and take logs using the current docker api. Does that mean that the current loki driver will be fixed to log like the json-file first then run internally a promtail that will take the logs from docker daemon to then push them to loki whenever it's possible..

Or we have to setup a promtail service that will do more or less the same job, and keep the default json-file logger.

I had an issue today with the loki driver unable to start the proxy because the loki interface was behind the proxy. I'm unsure how I was even able to setup the log driver initially.

But yes that merge request is right on time.

srstsavage commented 2 years ago

Or we have to setup a promtail service that will do more or less the same job, and keep the default json-file logger.

This is correct. If the new promtail functionality works as expected I think the Loki Docker logging driver will basically be considered deprecated.

I suspect that the new promtail feature will work with the more efficient Docker local logging driver as well, but haven’t yet tested.