vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
16.94k stars 1.46k forks source link

File Source: Unicode Null Characters #19628

Open andrelec1 opened 5 months ago

andrelec1 commented 5 months ago

A note for the community

Problem

.message fields contains a lot of "\u0000", before the real line contains.

image

Configuration

sources:
  t_log:
    type: "file"
    include: ["/***********log.tmp.log"]
    line_delimiter: "\n"
    glob_minimum_cooldown: 60000 # 1 min

transforms:
  t_log_to_json:
    type: "remap"
    inputs: ["t_log"]
    source: ".=parse_json!(.message)"

sinks:
  t_log_to_clickhouse:
    type: "clickhouse"
    inputs: ["t_log_to_json"]
    endpoint: "*******"
    database: "t_log"
    table: "{{ __ch_table }}"
    skip_unknown_fields: true
    auth:
       // ......

Version

see on 0.33 and 0.35

Debug Output

// too random to find log :/

Example Data

{"__ch_table":"metric_log","datetime":"2024-01-09 09:59:59","context_id":"1161bde682c4ed13f968000000000000","instance_id":0000,"instance_name":"eeeeeeeeeeee","method":"GET","path":"\/escaped\/path\/the\/hell\/currents","query":"","memory_usage":1123840,"nb_select":11,"nb_insert":0,"nb_update":0,"nb_delete":0,"nb_other":0,"nb_resort_select":0,"nb_resort_insert":0,"nb_resort_update":0,"nb_resort_delete":0,"nb_resort_other":0,"nb_common_select":0,"nb_common_insert":0,"nb_common_update":0,"nb_common_delete":0,"nb_common_other":0}

Additional Context

I have 2 serveur that ouput their log in the same NFS storage. I use a third server to run vector that read the data from the NFS storage.

Someone else having the same issue on the discord, but this guy using a S3 storage instead a NFS. This issues look like a issue about network storage :/

References

https://discord.com/channels/742820443487993987/1194253312212615219 https://discord.com/channels/742820443487993987/1149066915923365898

andrelec1 commented 5 months ago

From Discord :

jches: Skimming some old NFS mailing list posts, it sounds like this (reading null bytes from a file) is just something that can happen if the file is being read while it's open for writing. You'll probably need to take a similar approach as in that thread, so vector is only reading files that aren't being written to anymore https://www.spinics.net/lists/linux-nfs/msg49803.html

So this not a 'bug' in vector ...

But the file sink need to have a option to simply ignore the fact he read the last line if the line start by Null ...

tamer-hassan commented 4 weeks ago

you can try with this change for similar issue, on v0.38.0 sources https://github.com/tamer-hassan/vector/commit/518d4e17db2a698491cc3927df39de676d7ef523 I built only for windows since this is what I was concerned with.