loggie-io / loggie

A lightweight, cloud-native data transfer agent and aggregator
https://loggie-io.github.io/docs-en/
Apache License 2.0
1.24k stars 160 forks source link

bug: the deleted file was not released as expected #629

Open hansedong opened 1 year ago

hansedong commented 1 year ago

What version of Loggie?

1.5.0 rc

Expected Behavior

The deleted files was not released as expected.

image

The configuration of Loggie is (loggie.yml):

loggie:
  reload:
    enabled: true
    period: 10s

  monitor:
    logger:
      period: 30s
      enabled: true
    listeners:
      filesource: ~
      filewatcher: ~
      reload: ~
      sink: ~
      queue: ~
      pipeline: ~
      normalize: ~

  discovery:
    enabled: false
    kubernetes:
      containerRuntime: containerd
      kubeconfig: ~/.kube/config
      typePodFields:
        namespace: "${_k8s.pod.namespace}"
        podname: "${_k8s.pod.name}"
        containername: "${_k8s.pod.container.name}"
        nodename: "${_k8s.node.name}"
        logconfig: "${_k8s.logconfig}"

  defaults:
    sink:
      type: dev
    interceptors:
      - type: schema
        name: global
        order: 700
        addMeta:
          timestamp:
            key: "@timestamp"
    sources:
      - type: file
        watcher:
          scanTimeInterval: 3s
          maxOpenFds: 6000
          readFromTail: true
  http:
    enabled: true

The configuration of pipelines is:

pipelines:
  - name: accesslog
    sources:
      - type: file
        name: access_log
        paths:
        - /DATA1/XXX/XXX/logs/*access.log
        ignoreOlder: 72h
        addonMeta: false
        workerCount: 64
        readBufferSize: 65536
        charset: utf-8
        multi:
          active: false
        db:
          file: "/etc/loggie/loggie.db"
          cleanScanInterval: 1h
          cleanInactiveTimeout: 336h
          flushTimeout: 2s
    interceptors:
      - type: maxbytes
        maxBytes: 65536
      - type: schema
        addMeta:
          timestamp:
          key: "@timestamp"
          location: Local
        remap:
          body:
            key: message
      - type: rateLimit
        qps: 30000
      - type: retry
        retryMaxCount: 5
    sink:
      type: kafka
      brokers: ["10.132.13.92:9092","10.132.13.94:9092","10.132.13.93:9092","10.132.13.95:9092","10.132.13.96:9092","10.132.13.97:9092"]
      topic: ms_accesslog
      balance: roundRobin
      compression: gzip
      batchBytes: 1000000
      requiredAcks: 1
  - name: accesslog_to_trace
    sources:
      - type: file
        name: access_log
        paths:
        - /DATA1/XXX/XXX/logs/*access.log
    interceptors:
      - type: transformer
        actions:
          - action: jsonDecode(body, tmp)
          - if: NOT oneOf(tmp.upstream_status,200,201,301,302,500,502,504)
            then:
              - action: dropEvent()
          - action: del(@timestamp)
          - action: timestamp(tmp.time_iso8601)
            fromLayout: "2006-01-02T15:04:05-07:00"
            fromLocation: Asia/Shanghai
            toLayout: unix_ms
          - action: move(tmp.time_iso8601, timestamp)
          - action: move(tmp.remote_addr, source_addr)
          - action: move(tmp.remote_port, source_port)
          - action: split(tmp.upstream)
            separator: ":"
            keys: ["dest_addr", "dest_port"]
          - action: move(tmp.server_addr, host_ip)
          - action: move(tmp.hostname, host_name)
          - action: add(host_role, msgateway)
          - action: add(event_type, connect)
          - action: add(protocol, tcp4)
          - action: fmt(net_ns)
            pattern: ${net_ns}
          - action: fmt(process_pid)
            pattern: ${process_pid}
          - action: add(process_name, openresty)
          - action: del(tmp)
    sink:
      type: dev
      printEvents: false
  - name: errorlog
    sources:
      - type: file
        name: error_log
        paths:
        - /DATA1/XXX/XXX/logs/*error.log
        ignoreOlder: 72h
        addonMeta: false
        workerCount: 64
        readBufferSize: 65536
        charset: utf-8
        multi:
          active: true
          pattern: '^\d{4}/\d{2}/\d{2}'
        db:
          file: "/etc/loggie/loggie.db"
          cleanScanInterval: 1h
          cleanInactiveTimeout: 336h
          flushTimeout: 2s
    interceptors:
      - type: maxbytes
        maxBytes: 65536
      - type: schema
        addMeta:
          timestamp:
          key: "@timestamp"
          location: Local
        remap:
          body:
            key: message
      - type: transformer
        actions:
          - action: regex(message)
            pattern: (?<datetime>\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}) \[(?<level>\w+)\] (?<body>.*)
          - action: copy(state.hostname, hostname)
      - type: rateLimit
        qps: 30000
      - type: retry
        retryMaxCount: 5
    sink:
      type: kafka
      brokers: ["10.132.13.92:9092","10.132.13.94:9092","10.132.13.93:9092","10.132.13.95:9092","10.132.13.96:9092","10.132.13.97:9092"]
      topic: ms_errorlog
      balance: roundRobin
      compression: gzip
      batchBytes: 1000000
      requiredAcks: 1

I did not configure fdHoldTimeoutWhenInactive and fdHoldTimeoutWhenRemove, and the default values for these two parameters are both 5m. Therefore, based on my understanding, it is impossible for a file's fd to remain unreleased after being deleted within 5 minutes.

hansedong commented 1 year ago

Continuously follow up on this issue in recent times, I have preliminarily identified the problem and modified the code to solve it. It has been running online for 3 days now, and the issue has not reappeared.

The changes in the code are actually very minor, but I haven't fully understood the fundamental cause of the problem yet.

When I have time to thoroughly understand the underlying reasons behind the issue, I will follow up on this matter.