Missing log entry when telegraf restarts in tail

St3f1n commented 1 year ago

Relevant telegraf.conf

[global_tags]

[agent]
  interval = "1s"
  round_interval = true
  metric_batch_size = 10000
  metric_buffer_limit = 100000
  collection_jitter = "1s"
  flush_interval = "11s"
  flush_jitter = "1s"
  precision = "1ms"
  debug = true
  quiet = false
  logfile = "./telegraf.log"
  logtarget = "file"
  logfile_rotation_interval = 0
  logfile_rotation_max_size = "100KB"
  logfile_rotation_max_archives = 50
  hostname = ""
  omit_hostname = true

[[inputs.tail]]
  name_override = "Product_State"
  interval = "2s"
  files = ["C://_ProduData//*//Log//ProductState//Log_ProductState_*.csv"]
  from_beginning = false
  watch_method = "poll"
  character_encoding = "utf-8"
  max_undelivered_lines = 4000
  data_format = "csv"
  csv_header_row_count = 1
  csv_column_names = ["Timestamp","ProductionRunGUID"]
  csv_column_types = ["string", "string"]
  csv_skip_rows = 0
  csv_skip_columns = 0
  csv_delimiter = ";"
  csv_trim_space = false
  csv_tag_columns = []
  csv_measurement_column = ""
  csv_timestamp_column = "Timestamp"
  csv_timestamp_format = "2006-01-02T15:04:05.999Z"
  csv_timezone = "UTC"
  csv_skip_values = [""]
  path_tag = ""
  [inputs.tail.tags]
    _in = "ProductStateTest"

[[outputs.file]]
  files = ["stdout", "./tail_test.out"]
  [outputs.file.tagpass]
   _in = ["ProductStateTest"]

Logs from Telegraf

...
2023-05-23T08:36:34Z I! Loaded inputs: tail
2023-05-23T08:36:34Z I! Loaded aggregators: 
2023-05-23T08:36:34Z I! Loaded processors: 
2023-05-23T08:36:34Z I! Loaded outputs: file
2023-05-23T08:36:34Z I! Tags enabled: 
2023-05-23T08:36:34Z I! [agent] Config: Interval:1s, Quiet:false, Hostname:"", Flush Interval:11s
2023-05-23T08:36:34Z D! [agent] Initializing plugins
2023-05-23T08:36:34Z D! [agent] Connecting outputs
2023-05-23T08:36:34Z D! [agent] Attempting connection to [outputs.file]
2023-05-23T08:36:34Z D! [agent] Successfully connected to outputs.file
2023-05-23T08:36:34Z D! [agent] Starting service inputs
2023-05-23T08:36:34Z D! [inputs.tail] Tail added for "C:\\_ProduData\\99999\\Log\\ProductState\\Log_ProductState_TEST.csv"
2023-05-23T08:36:46Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:36:57Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:37:09Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:37:20Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:37:32Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:37:43Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:37:55Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
2023-05-23T08:38:06Z D! [outputs.file] Wrote batch of 1 metrics in 359.1µs
2023-05-23T08:38:06Z D! [outputs.file] Buffer fullness: 0 / 100000 metrics
...

System info

Windows10 64 bit, the issue is verified on telegraf versions 1.20.3 and 1.26.3.

Docker

No response

Steps to reproduce

Start telegraf with a tail plugin and csv_header_row_count = 1.
Read in a log file and produce log-entries "A" and "B". In this example Log_ProductState_TEST.csv with content:

Timestamp;ProductionRunGUID; 2023-05-15T04:58:24.001Z;A; 2023-05-15T04:58:25.002Z;B;

Restart telegraf.
Add new log-entry "C" to the tailed log file.
Add a second new log-entry "D" to the tailed log file. Save the file.
Check the telegraf output. Only the second log-entry "D" gets shown, "C" gets missed.

Expected behavior

Telegraf tail csv should check that the header is only at the beginning of a file and not the first new log-entry. Both log-entries "C" and "D" should get shown.

content of ./tail_test.out:

Product_State,_in=ProductStateTest ProductionRunGUID="A" 1684126704001000000 Product_State,_in=ProductStateTest ProductionRunGUID="B" 1684126705002000000 Product_State,_in=ProductStateTest ProductionRunGUID="C" 1684126705003000000 Product_State,_in=ProductStateTest ProductionRunGUID="D" 1684126705004000000

Actual behavior

Only the second log-entry "D" gets shown, the first log-entry "C" gets missed:

content of ./tail_test.out:

Product_State,_in=ProductStateTest ProductionRunGUID="A" 1684126704001000000 Product_State,_in=ProductStateTest ProductionRunGUID="B" 1684126705002000000 Product_State,_in=ProductStateTest ProductionRunGUID="D" 1684126705004000000

Additional info

Workaround 1: Using from_beginning = true. However, this is not a nice solution as the complete files get reread whenever telegraf restarts. Not an option for my use case unfortunatly but maybe for other use cases.

Workaround 2: One can use csv_header_row_count = 0 instead and filter out the header data in a e.g. starlark processor later on. However, this might limit other features, get kind of complex and as well error prone. Therefore a fix of the bug would be highly appreciated.

srebhan commented 1 year ago

I'm not sure what you want to say. Telegraf works as configured, it starts from the end of the file and expects the first line to be the header... What you can do is to define the column names by hand and remove csv_header_row_count...

St3f1n commented 1 year ago

I'm not sure what you want to say. Telegraf works as configured, it starts from the end of the file and expects the first line to be the header... What you can do is to define the column names by hand and remove csv_header_row_count...

It seems to me that it rather takes the first detected line as header, which is not the same like the very first line in the file. The result is clear and a bug in my point of view. The "C" line is missing which should not be the case.

powersj commented 1 year ago

@St3f1n,

There are at least three scenarios that are at play with the tail plugin:

initial - This is when we first start telegraf and read the file for the first time. Here you set the header value to 1, the first row contains headers, and the first row is skipped. No issue.
overwrite - The file you are tailing can also be overwritten entirely. In this case, you probably would be writing the header again to be consistent with the previous state. No issue here either.
append - Here you append new data to the file, without a header. Tail gets the new data and passes it to the CSV parser. However, how does the CSV parser know that the data it got from the tail plugin is appended data and not data from an overwrite? In your case, because the header was set to 1, we skip the first row. Hence this issue.

Telegraf tail csv should check that the header is only at the beginning of a file and not the first new log-entry.

I wanted to call these different scenarios out because it is not as simple as not skipping the header after the initial pass of the file. Additionally, the CSV parser does not know anything more than the new lines that were read in. It has no knowledge of what line it is on, nor does telegraf know. The tail library, provides new lines and the time they were written.

One can use csv_header_row_count = 0 instead and filter out the header data in a e.g. starlark processor later on. However, this might limit other features, get kind of complex and as well error prone. Therefore a fix of the bug would be highly appreciated.

I am open to suggestions, but using one of the workarounds or not having a header row in the first place seem to be a much more expedient option.

St3f1n commented 1 year ago

@powersj Thank you for your thoughts and scenarios, very appreciating that.

However, how does the CSV parser know that the data it got from the tail plugin is appended data and not data from an overwrite? In your case, because the header was set to 1, we skip the first row. Hence this issue.

=> In my point of view the overwriting scenario might exist, but here the tail plugin is probably not the optimal plugin, rather the file plugin. I estimate that the overwriting scenario is not the main, even the wrong use case for the tail plugin, and that the tail plugin should clearly focus on the appending scenario, in particular collecting data from log files. With the stated effect, i expect that many implementations of such csv log file collections are missing first entries after telegraf or the host restarts, not being aware about it...

I wanted to call these different scenarios out because it is not as simple as not skipping the header after the initial pass of the file. Additionally, the CSV parser does not know anything more than the new lines that were read in. It has no knowledge of what line it is on, nor does telegraf know. The tail library, provides new lines and the time they were written.

Without looking into the code, i think that the tail plugin knows the line number where it was reading the last time, and probably somehow it can influence this number.

Maybe it would be possible to improve the tail plugin with a corresponding header row mechanism like it is available in the csv parser? Than the tail plugin would know when the file's line number is above the header row definition, that it does not have to apply it when telegraf got restarted? Maybe something like this could be a solution, also applicable for use cases with other parsers?
Or is there a possibility that the tail plugin takes the csv_header_row_count parameter into account when it uses the csv parser?

Other opinions or suggestions around?

powersj commented 1 year ago

Without looking into the code

This is a messy situation :) the csv parser has some knowledge of state, but that state and the state from tail are not shared with each other. One possibility is to add an option to the tail plugin that optionally would override any header settings of the CSV parser, causing the CSV parser to parse the header value only the first time. Or another option is to somehow keep track of where you are in the file and if it is not the first line, then skip the header settings in the CSV parser.

We strongly do not like having these types of special cases where parsers work differently depending on the plugin, but in this case I am not seeing another option. We would take a PR that would allow something like the above, but this isn't something we would actively produce anytime soon.

St3f1n commented 9 months ago

A messy situation, indeed :). Maybe there could be a 3rd possibility which would affect the tail plugin only, without the parsers at all. This possibility would be to define the number of the first lines in the file which should get taken by tail and forwarded to the parser whenever tail newly started. Per default, this should be 0 to not change the todays behavior. But as an option, one can increase it to the sum of metadata and header row lines. As it only checks for the number of lines, there is no direct interaction with any parser and parsing would still be the only responsibility of the parsers itself. Tail gets the lines, the parsers get the line contents. At least for the csv parser, the meta data and header rows are defined numbers of rows, lines respectively. Any considerations or further ideas around?

srebhan commented 9 months ago

@St3f1n that might be doable. You might even be able to define

    CSVHeaderRowCount *uint64 `toml:"csv_header_row_count"`
   ...

in the tail plugin and use those instead of defining another (redundant) setting...

influxdata / telegraf