vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.82k stars 1.58k forks source link

Support checkpointing and reading old logs with `docker_logs` source #7358

Open pgassmann opened 3 years ago

pgassmann commented 3 years ago

Vector Version

vector 0.13.1 (v0.13.1 x86_64-unknown-linux-gnu 2021-04-29)

Expected Behavior

Actual Behavior

Logs are only read from the moment when vector is started. when restarting vector, logs from the containers are missing. when rebooting and the docker containers are started before vector, the startup logs are missing.

Example Data

Additional Context

In #1107 in the description, there is this point:

  • docker source explicitly only looks at records starting at Vector start time, so no changes required

@jszwedko This is not explicitly mentioned in the docker source documentation. https://vector.dev/docs/reference/configuration/sources/docker_logs/#how-it-works

References

Vector Configuration File

# Set global options
data_dir = "/var/lib/vector"

# Read logs from Docker API and Send to loki sink

[sources.docker-local]
  type = "docker_logs"
  docker_host = "/var/run/docker.sock"
  exclude_containers = []

  # Identify zero-width space as first line of a multiline block.
  multiline.condition_pattern = '^\x{200B}' # required
  multiline.mode = "halt_before" # required
  multiline.start_pattern = '^\x{200B}' # required
  multiline.timeout_ms = 1000 # required, milliseconds

[sinks.loki]
  # General
  type = "loki" # required
  inputs = ["docker-local"] # required
  endpoint = "https://logs.example.com:443" # required

  # Auth
  auth.strategy = "basic" # required
  auth.user = "" # required
  auth.password = "" # required

  # Encoding
  encoding.codec = "json" # required

  # Healthcheck
  healthcheck.enabled = true # optional, default

  # Labels
  labels.forwarder = 'vector'
  labels.host = '{{ host }}'
  labels.container_name = '{{ container_name }}'
  labels.compose_service = '{{ label.com\.docker\.compose\.service }}'
  labels.compose_project = '{{ label.com\.docker\.compose\.project }}'
  labels.source = '{{ stream }}'
  labels.category = 'dockerlogs'
pgassmann commented 3 years ago

Related to #7336

jszwedko commented 3 years ago

Thanks @pgassmann . I agree this would be good to do. I updated it to an enhancement as I believe the current behavior is intentional (albeit undocumented) given the documentation also doesn't document the inverse, that it will pick up logs from before it started.

pgassmann commented 3 years ago

@jszwedko how are the plans to implement checkpointing and history for docker logs? We are currently experiencing issues with vector sending logs from docker to loki and because of this missing feature we are losing hours of logs.

jszwedko commented 2 years ago

@pgassmann unfortunately no movement on this yet. You could consider eschewing the docker_logs source and using the file source with the container log files. This would give you checkpointing, of course, but would be missing the container metadata that the docker_logs source adds.

pgassmann commented 2 years ago

The latest release of loki/promtail has support for docker service discovery, which seems to combine service discovery through the api and reading logs from the json files. Supporting checkpointing. https://grafana.com/docs/loki/latest/clients/promtail/configuration/#docker_sd_config

That's a major selling point to move back to promtail, as currently with vector we cannot guarantee that all docker logs are transported to loki.

pgassmann commented 2 years ago

Checkpointing should be quite trivial to implement by using the since option of the log api. currently it is always set to "now", but this can be set to the last known timestamp or a timestamp calulated by now - max_lookback_duration

https://github.com/vectordotdev/vector/blob/b9f661b8b6066774cae34595f1b6231dc8d4586f/src/sources/docker_logs.rs#L271-L275

The docker_logs source just has to keep track of the last read timestamp by container. This can be part of the acknowledgement #7650 i.e. only update the checkpoint timestamp when the event is confirmed on the sink.

After a restart of vector, it should query for the logs of all saved container checkpoints. even if they are no longer running. (e.g. stopped after vector was stopped) but the remaining logs can still be queried from docker api. For yet unknown containers, vector should also query for a configurable amount of time before "now". max_lookback_duration (e.g. when a container was created/started (shortly) before vector)

cc @bruceg

pgassmann commented 1 year ago

@jszwedko @bruceg Can someone please look into this and give feedback to my suggestion? Vector again lost some important logs from a migration, because of this and #16806

jszwedko commented 1 year ago

@jszwedko @bruceg Can someone please look into this and give feedback to my suggestion? Vector again lost some important logs from a migration, because of this and #16806

Ouch, that sucks. I'm sorry to hear that you last some logs. Your suggestion for the checkpointing strategy makes sense to me. Unfortunately I don't know when exactly we would get to it, but we'd be happy to help support a PR if you (or anyone else) is motivated.

pgassmann commented 4 months ago

There hasn't been progress on that for two years now. Our Devs are getting frustrated, because we lose logs during maintenance windows where we reboot the hosts. We will now have to switch to a different log collector.

pgassmann commented 3 months ago

We now switched to promtail for collecting docker container logs. Here's our promtail configuration. you can find our full config in the ansible role: https://github.com/teamapps-org/ansible-collection-teamapps-general/tree/main/roles/promtail

scrape_configs:
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
        # filters:
        #   - name: name
        #     values: [test-container]
    relabel_configs:
      - source_labels: ['__meta_docker_container_label_com_docker_compose_container_number']
        target_label: 'compose_container_number'
        action: 'replace'
        replacement: '${1}'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project']
        target_label: 'compose_project'
        action: 'replace'
        replacement: '${1}'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_project_working_dir']
        target_label: 'compose_project_working_dir'
        action: 'replace'
        replacement: '${1}'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_oneoff']
        target_label: 'compose_oneoff'
        action: 'replace'
        replacement: '${1}'
      - source_labels: ['__meta_docker_container_label_com_docker_compose_service']
        target_label: 'compose_service'
        action: 'replace'
        replacement: '${1}'
      - source_labels: ['__meta_docker_container_id']
        target_label: 'container_id'
        action: 'replace'
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container_name'
        regex: '/(.*)'
        action: 'replace'

      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'stream'
        action: 'replace'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'source'
        action: 'replace'
      - source_labels: ['__meta_docker_container_log_stream']
        target_label: 'source_type'
        action: 'replace'

      - target_label: 'category'
        replacement: 'dockerlogs'
      - target_label: 'job'
        replacement: 'docker'
      ## Map all labels
      # - action: labelmap
      #   regex: '__meta_docker_container_label_(.+)'
      #   replacement: 'container_labels_${1}'
    pipeline_stages:
      # combine multiline messages like stacktraces to one message,
      # needs configuration in application to prefix logs with [zero-width-space char.](https://unicode-explorer.com/c/200B)
      - multiline:
          firstline: '^\x{200B}'
          max_wait_time: 1s
      - drop:
          older_than: 4h
          drop_counter_reason: "line_too_old"