Open St3f1n opened 1 year ago
I'm not sure what you want to say. Telegraf works as configured, it starts from the end of the file and expects the first line to be the header... What you can do is to define the column names by hand and remove csv_header_row_count
...
I'm not sure what you want to say. Telegraf works as configured, it starts from the end of the file and expects the first line to be the header... What you can do is to define the column names by hand and remove
csv_header_row_count
...
It seems to me that it rather takes the first detected line as header, which is not the same like the very first line in the file. The result is clear and a bug in my point of view. The "C" line is missing which should not be the case.
@St3f1n,
There are at least three scenarios that are at play with the tail plugin:
Telegraf tail csv should check that the header is only at the beginning of a file and not the first new log-entry.
I wanted to call these different scenarios out because it is not as simple as not skipping the header after the initial pass of the file. Additionally, the CSV parser does not know anything more than the new lines that were read in. It has no knowledge of what line it is on, nor does telegraf know. The tail library, provides new lines and the time they were written.
One can use csv_header_row_count = 0 instead and filter out the header data in a e.g. starlark processor later on. However, this might limit other features, get kind of complex and as well error prone. Therefore a fix of the bug would be highly appreciated.
I am open to suggestions, but using one of the workarounds or not having a header row in the first place seem to be a much more expedient option.
@powersj Thank you for your thoughts and scenarios, very appreciating that.
However, how does the CSV parser know that the data it got from the tail plugin is appended data and not data from an overwrite? In your case, because the header was set to 1, we skip the first row. Hence this issue.
=> In my point of view the overwriting scenario might exist, but here the tail plugin is probably not the optimal plugin, rather the file plugin. I estimate that the overwriting scenario is not the main, even the wrong use case for the tail plugin, and that the tail plugin should clearly focus on the appending scenario, in particular collecting data from log files. With the stated effect, i expect that many implementations of such csv log file collections are missing first entries after telegraf or the host restarts, not being aware about it...
I wanted to call these different scenarios out because it is not as simple as not skipping the header after the initial pass of the file. Additionally, the CSV parser does not know anything more than the new lines that were read in. It has no knowledge of what line it is on, nor does telegraf know. The tail library, provides new lines and the time they were written.
Without looking into the code, i think that the tail plugin knows the line number where it was reading the last time, and probably somehow it can influence this number.
Other opinions or suggestions around?
Without looking into the code
This is a messy situation :) the csv parser has some knowledge of state, but that state and the state from tail are not shared with each other. One possibility is to add an option to the tail plugin that optionally would override any header settings of the CSV parser, causing the CSV parser to parse the header value only the first time. Or another option is to somehow keep track of where you are in the file and if it is not the first line, then skip the header settings in the CSV parser.
We strongly do not like having these types of special cases where parsers work differently depending on the plugin, but in this case I am not seeing another option. We would take a PR that would allow something like the above, but this isn't something we would actively produce anytime soon.
A messy situation, indeed :). Maybe there could be a 3rd possibility which would affect the tail plugin only, without the parsers at all. This possibility would be to define the number of the first lines in the file which should get taken by tail and forwarded to the parser whenever tail newly started. Per default, this should be 0 to not change the todays behavior. But as an option, one can increase it to the sum of metadata and header row lines. As it only checks for the number of lines, there is no direct interaction with any parser and parsing would still be the only responsibility of the parsers itself. Tail gets the lines, the parsers get the line contents. At least for the csv parser, the meta data and header rows are defined numbers of rows, lines respectively. Any considerations or further ideas around?
@St3f1n that might be doable. You might even be able to define
CSVHeaderRowCount *uint64 `toml:"csv_header_row_count"`
...
in the tail
plugin and use those instead of defining another (redundant) setting...
Relevant telegraf.conf
Logs from Telegraf
System info
Windows10 64 bit, the issue is verified on telegraf versions 1.20.3 and 1.26.3.
Docker
No response
Steps to reproduce
Timestamp;ProductionRunGUID; 2023-05-15T04:58:24.001Z;A; 2023-05-15T04:58:25.002Z;B;
Expected behavior
Telegraf tail csv should check that the header is only at the beginning of a file and not the first new log-entry. Both log-entries "C" and "D" should get shown.
content of ./tail_test.out:
Product_State,_in=ProductStateTest ProductionRunGUID="A" 1684126704001000000 Product_State,_in=ProductStateTest ProductionRunGUID="B" 1684126705002000000 Product_State,_in=ProductStateTest ProductionRunGUID="C" 1684126705003000000 Product_State,_in=ProductStateTest ProductionRunGUID="D" 1684126705004000000
Actual behavior
Only the second log-entry "D" gets shown, the first log-entry "C" gets missed:
content of ./tail_test.out:
Product_State,_in=ProductStateTest ProductionRunGUID="A" 1684126704001000000 Product_State,_in=ProductStateTest ProductionRunGUID="B" 1684126705002000000 Product_State,_in=ProductStateTest ProductionRunGUID="D" 1684126705004000000
Additional info
Workaround 1: Using from_beginning = true. However, this is not a nice solution as the complete files get reread whenever telegraf restarts. Not an option for my use case unfortunatly but maybe for other use cases.
Workaround 2: One can use csv_header_row_count = 0 instead and filter out the header data in a e.g. starlark processor later on. However, this might limit other features, get kind of complex and as well error prone. Therefore a fix of the bug would be highly appreciated.