vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.45k stars 1.52k forks source link

File source: wrong offset if line is written without delimiter #21084

Open mikelsid opened 4 weeks ago

mikelsid commented 4 weeks ago

A note for the community

Problem

In case a message is written partially at first (without delimiter), the offset calculation of the line is incorrect.

Configuration

The issue can be reproduced by using this script:

# fake-logger.sh
file1=tests/tdata/fake-log-1.log
echo -n "a" > $file1
sleep 3
echo "b" >> $file1
echo "c" >> $file1

file2=tests/tdata/fake-log-2.log
echo "ab" > $file2
echo "c" >> $file2

Vector config:

# vector.yaml
data_dir: tests/tdata
sources:
  log:
    include:
    - tests/tdata/fake-log*.log
    read_from: beginning
    fingerprint:
      strategy: device_and_inode
    type: file
    offset_key: offset

sinks:
  console:
    type: console
    inputs:
      - log
    encoding:
      codec: json

Version

0.41

Debug Output

❯ cargo vdev run --debug tests/vector.yaml
   Compiling file-source v0.1.0 (/home/msidorov/vector/lib/file-source)
   Compiling vector v0.41.0 (/home/msidorov/vector)
   Compiling vector-lib v0.1.0 (/home/msidorov/vector/lib/vector-lib)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 32.37s
     Running `target/debug/vector --config tests/vector.yaml`
2024-08-15T20:24:44.847421Z  INFO vector::app: Log level is enabled. level="info"
2024-08-15T20:24:44.849546Z  INFO vector::app: Loading configs. paths=["tests/vector.yaml"]
2024-08-15T20:24:44.855881Z  INFO vector::topology::running: Running healthchecks.
2024-08-15T20:24:44.856180Z  INFO vector::topology::builder: Healthcheck passed.
2024-08-15T20:24:44.856259Z  INFO vector: Vector has started. debug="true" version="0.41.0" arch="x86_64" revision=""
2024-08-15T20:24:44.856386Z  INFO source{component_kind="source" component_id=log component_type=file}: vector::sources::file: Starting file server. include=["tests/tdata/fake-log*.log"] exclude=[]
2024-08-15T20:24:44.856831Z  INFO source{component_kind="source" component_id=log component_type=file}:file_server: file_source::checkpointer: Attempting to read legacy checkpoint files.
2024-08-15T20:25:01.263737Z  INFO source{component_kind="source" component_id=log component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=tests/tdata/fake-log-1.log
2024-08-15T20:25:03.313237Z  INFO source{component_kind="source" component_id=log component_type=file}:file_server: vector::internal_events::file::source: Found new file to watch. file=tests/tdata/fake-log-2.log
{"file":"tests/tdata/fake-log-1.log","host":"...","message":"ab","offset":1,"source_type":"file","timestamp":"2024-08-15T20:25:03.314009749Z"}
{"file":"tests/tdata/fake-log-1.log","host":"...","message":"c","offset":3,"source_type":"file","timestamp":"2024-08-15T20:25:03.314240274Z"}
{"file":"tests/tdata/fake-log-2.log","host":"...","message":"ab","offset":0,"source_type":"file","timestamp":"2024-08-15T20:25:03.314342546Z"}
{"file":"tests/tdata/fake-log-2.log","host":"...","message":"c","offset":3,"source_type":"file","timestamp":"2024-08-15T20:25:03.314438288Z"}

As seen here, the offset of the first event in tests/tdata/fake-log-1.log is 1, which is unexpected

mikelsid commented 4 weeks ago

It seems like the offset was indeed set to the offset of the newly written part of the line