filelog receiver reads old logs even when start_at is set to end

obs-gh-enricogiorio commented 2 weeks ago

Hi @djaglowski I am still an issue where filelog starts reading from beginning of files despite a fix that was mentioned in this other issue. That issue mentions that when the max concurrent files is reached, the receiver enters "batch" mode, and starts reading from the beginning of the files, regardless of what "start_at" is set to. I made sure start_at is set to end and I have checked the receiver's logs, nothing mentions that any batching happened. Also given the amount of log files I have in the system, I can pretty much ensure that the default limit of 1024 concurrent files has ever been reached.

For instance, on 16/10/24 13:22:24.363 CEST, the receiver read a log line with timestamp 14/05/24 14:40:41.042 CEST, which is >5 months old.

From this screenshot you can see:

BUNDLE_TIMESTAMP is when the log line was parsed (technically it's when it reached our system, but it's likely just a few seconds earlier) timestamp is the log's timestamp delta is the discrepancy between the two. The logs you see are 4 months old The purple bits above are when the data falls over time. You can see that this did not happen constantly but only twice, with many instances of very old logs being read all together

This is how I configured the receiver:

  filelog:
    exclude:  '[]'
    include:  '["/var/log/pods/*/*/*.log", "/var/log/kube-apiserver-audit.log"]'
    include_file_name: false
    include_file_path: true
    exclude_older_than: 24h
    operators:
    - id: container-parser
      max_log_size: 102400
      type: container
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval:  30s
      max_elapsed_time: 5m
    start_at: "end"
    storage: file_storage

djaglowski commented 2 weeks ago

For instance, on 16/10/24 13:22:24.363 CEST, the receiver read a log line with timestamp 14/05/24 14:40:41.042 CEST, which is >5 months old.

Just to be clear, neither start_at or exclude_older_than settings are intended to filter logs based on their timestamps. These settings are both concerned with aspects of the files themselves.

The start_at setting only has an effect when the receiver first starts running. From then on, any file which appears to be a new file will be ingested from the beginning of the file. Given that, is there any chance that these files containing old logs were moved, or could new logs have been written to the files?

VihasMakwana commented 2 weeks ago

Perhaps related to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/32727. Not sure though

VihasMakwana commented 2 weeks ago

Maybe, what's happening is:

You recently updated a file which was 4 months old.
By this time, the collector was already running, and setting start_at wouldn’t impact the later set of matched files
The receiver considers that the 4 months old file new (because it was updated) and read everything from beginning.

obs-gh-enricogiorio commented 2 weeks ago

I need to check if it's even possible that k8s does that. The files I am parsing are all and only kubernetes log files (pod logs).

Just to be clear: what you are saying is that k8s might create a new log file (maybe mv an old file), so the receiver will (obviously) start reading it from the beginning, but that file might contain old log lines?

And one more clarification before I start looking at the files (I think I will look into whether there are files with creation_timestamp that is more recent than the first log row's timestamp). file_storage is optional: if you don't set it, it won't use any external storage, it just keeps pointers in memory. This means that upon restart, the start_at: end is enforced. HOWEVER, if you set an external storage (either file or db), you DON'T start at the end, but you pick up from where you left, right? So is my understanding correct that start_at: end is only relevant when not setting file_storage?

obs-gh-enricogiorio commented 1 week ago

I confirm that this issue is NOT due to k8s renaming files, but it happens at startup.

I am deploying a stack of agent replicas via helm charts. When I run helm install ..., the following error shows up instantly in the agent container logs:

2024-11-06T10:23:57.194Z        error   internal/queue_sender.go:100    Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "logs", "name": "otlphttp/observe/base", "error": "not retryable error: Permanent error: rpc error: code = Unknown desc = error exporting items, request to https://102.collect.observe-o2.com/v2/otel/v1/logs responded with HTTP Status Code 422, Message=some or all observation timestamps are too old or too far in the future: [1730547034602821748, 1730888635972996101), Details=[]", "dropped_items": 4096}
go.opentelemetry.io/collector/exporter/exporterhelper/internal.NewQueueSender.func1
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/exporterhelper/internal/queue_sender.go:100
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        /home/runner/work/observe-agent/observe-agent/vendor/go.opentelemetry.io/collector/exporter/internal/queue/consumers.go:43

(this is due to our observability backend that collects data sent via OTEL agent rejecting the logs with 422 because they have outdated timestamps). The error comes from the exporter, but obviously this is because filelog has read some old logs that are supposed to be left behind.

Is it possible that a change in the agent config/code makes the receiver start from the beginning of the logs at startup, even if we specify start_at: end? I don't see why it would start at the beginning at startup otherwise...

djaglowski commented 1 week ago

Just to be clear: what you are saying is that k8s might create a new log file (maybe mv an old file), so the receiver will (obviously) start reading it from the beginning, but that file might contain old log lines?

No, the receiver will not read a moved file at all. It will recognize it as the same file it's previously seen and ignore it (unless there are new lines to read, in which case it will read those)

file_storage is optional: if you don't set it, it won't use any external storage, it just keeps pointers in memory. This means that upon restart, the start_at: end is enforced. HOWEVER, if you set an external storage (either file or db), you DON'T start at the end, but you pick up from where you left, right?

Correct

So is my understanding correct that start_at: end is only relevant when not setting file_storage?

Pretty much, but to be more specific, it is only relevant the first time the receiver polls for files. If there is a storage extension with previously saved state, then the receiver knows it is not the first time it's polled for files.

djaglowski commented 1 week ago

Is it possible that a change in the agent config/code makes the receiver start from the beginning of the logs at startup, even if we specify start_at: end?

If the files appear after the receiver has already been running, then it knows they are new files and reads them from the beginning. start_at: end is just a way to say "i don't want to read logs that already exist at the time the receiver starts". If they are found later, then they are read regardless of start_at because they are considered new.

open-telemetry / opentelemetry-collector-contrib

filelog receiver reads old logs even when start_at is set to end #36091