Tail input plugin does not read line-by-line and sends partial line data

dr3amville commented 2 years ago

Relevant telegraf.conf

# Configuration for telegraf agent
[agent]
    interval = "60s"
    debug = false
    hostname = "<hostname>"
    round_interval = true
    flush_interval = "60s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 3000
    metric_buffer_limit = 15000
    quiet = false
    logtarget = "file"
    logfile = "/opt/logs/telegraf.log"
    logfile_rotation_interval = "24h"
    logfile_rotation_max_size = "10MB"
    logfile_rotation_max_archives = 5
    omit_hostname = false

[[outputs.influxdb]]
    urls = ["https://<influxdb>:8085"]
    database = "monitoring"
    precision = "60s"
    insecure_skip_verify = true

[[inputs.tail]]
    files = ["/path/to/file/telegraf-metrics.log*"]
    data_format = "influx"
    path_tag = ""
[inputs.tail.tags]
    metric_type = "manager"

Logs from Telegraf

2022-06-22T19:26:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-22-15-22": ["e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"]: metric parse error: expected field at 1:284: "e.partitioncount=2048,offheap.backupcount=27,offheap.entrycount=36,offheap.primarycount=9,offheap.size=0,onheap.entrycount=0,put.rate=1111,rebalance.keyleft=0,rebalance.keyrate=0,rebalance.partitioncount=0,tx.commit.size=0,tx.commit.time=141250,tx.rollback.size=0 1655925960896000000"

System info

Telegraf 1.18.3 (git: HEAD 6a94f65a), RHEL 7.9 (Maipo) 63.10.0-1160.31.1.el7.x86_64 and RHEL 6.10 (Santiago) 2.6.32-754.35.1.el6.x86_64

Docker

No response

Steps to reproduce

Application writes out InfluxDB line protocol formatted metrics to file.
Tail plugin reads and publishes to configured InfluxDB output.
Malformed lines messages in logs and bad measurements in InfluxDB.

Expected behavior

Full lines should be sent instead of partially reading lines and sending them.

Actual behavior

Lines are partially being read and sent leading to large number of bad measurements in InfluxDB.

Additional info

No response

powersj commented 2 years ago

Hi,

Some questions for you:

Can you provide this log file to help reproduce the issue?
If not, can you at least provide a few of the lines before and after the failing line?
Have you verified that the log file does not have any extra new lines in this line? Influx line protocol breaks on new lines. It makes me wonder what comes right before the e.partitioncount=2048.

Thanks!

dr3amville commented 2 years ago

Sure, the file I'd rather not provide because there's a good amount of sensitive data in there, and the logs I mentioned have rotated out but here's another example:

Telegraf error log:

2022-06-23T14:05:00Z E! [inputs.tail] Malformed log line in "/path/to/file/telegraf-metrics.log_2022-06-23-09-59": ["sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"]: metric parse error: expected field at 1:115: "sPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000"

Lines from the metric log file (4th line):

timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=samples execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=total execution.succeeded=0 1655993100897000000
timer,plugins=UpdateCommunityStringPlugIn,profile=uptime_profile,agg=average execution.succeeded=0 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=samples allocationRate=1,dirtyPages=1,evictionRate=1,largeEntriesPagesPercentage=1,pagesFillFactor=1,pagesReplaceRate=1,physicalMemoryPages=1,totalAllocatedPages=1 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=total allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000
memory,igniteRegion=sysMemPlc,agg=average allocationRate=0,dirtyPages=0,evictionRate=0,largeEntriesPagesPercentage=0,pagesFillFactor=0,pagesReplaceRate=0,physicalMemoryPages=200,totalAllocatedPages=200 1655993100897000000

Have you verified that the log file does not have any extra new lines in this line?

Yes, checked over the span of weeks of data to make sure actually because I was paranoid it was a writer issue but the log files are correct, no hidden characters or line breaks.

powersj commented 2 years ago

Thanks for some example data! However, copying that data and using that with the tail and/or file did not reproduce.

You do not have from_beginning as true in your config, so I assume you are appending to this line and not re-writing the file?

dr3amville commented 2 years ago

Yeah unfortunately it's not clear what causes it to do this, it is totally random but I feel like it should not read and flush metrics unless it is collecting the log line by line. I do not have from_beginning as true, correct, so just append and get new metrics and push. I also tried to get the file to explicitly show hidden characters but nothing showed besides line endings where they belong, etc.

It was suggested by @srebhan in the InfluxDB Community Slack that maybe a line-by-line vs at-once explicit reading mode definition might be worth implementing to try and resolve this, similar to what was done here: https://github.com/influxdata/telegraf/pull/11234

powersj commented 2 years ago

@srebhan @reimda thoughts on where to go from here?

dr3amville commented 2 years ago

Just a few notes as well, it looks to be happening way more commonly with RHEL 6 than 7, not exactly sure why. Also just for reference, the largest batch of writes to the file that happens that I could find was approx 621 lines, 1863 words, 88710 bytes.

I'm guessing the only way to emulate this is to set up a simulated influx line protocol metric generator and write it out to file (maybe even just have telegraf itself do it and to a file output?) and then also configure the tail plugin to read that file and output to Influx. This way you can have it generating InfluxDB Line protocol formatted metric data writing to file and then the reading of that file and ingesting to InfluxDB.

powersj commented 2 years ago

Did this suddenly start happening recently or is this a new problem after some sort of upgrade?

We have plenty of users who are using the tail plugin with influx protocol. While we could start generating a bunch of metrics in a VM this is sounding more and more like a system-specific issue if you are seeing differences between OSes. Are these systems under heavy load or have lots of disk writes going on at the same time?

edit: more questions:

Are these systems baremetal, VMs or containers?
What type of underlying storage is used? HDD, SDD, NVMe, cloud storage?
Over what period of time are those 621 metrics written in: every second? minute?

dr3amville commented 2 years ago

It has been this way since we started adopting this method of publishing the metrics. We used to use a different stack before and leveraged filebeat to format the metrics a bit (it was in graphite format before) before publishing and we never saw any issues there.

It is a pretty basic VM setup, nothing crazy and disk IO (SSD) is extremely light (<2% almost all the time) and CPU usage is 90-95% idle for all nodes that see this issue all the time.

CPU (usage):

Disk IO (SSDs and plenty of diskspace):

The 621 metrics would be once per minute almost instantaneously. I have tried using the inotify and poll methods alike for the tail plugin and see the same issues.

We run our java applications on these nodes but since it's a really light test system, it doesn't see much usage, it's really just a test env for this observability work.

powersj commented 2 years ago

ok thanks for those metrics, so it doesn't sound like a load issue. I guess I could try to reproduce in a CentOS VM next.

srebhan commented 2 years ago

I think, the OS does not flush line-wise, but just flushes the buffer and this buffer might contain a partial line/line-fragment. The only way I see to prevent this is by enforce waiting for a complete line (i.e. wait for the newline to appear) and hold back all partial writes... To do this I suggest to implement #11234 here... Maybe we should have a common place for this kind of "scanner"...

dr3amville commented 2 years ago

@powersj - any chance you were you able to reproduce this? Do we feel like implementing #11234 might be worth a shot? I could test it out potentially.

dr3amville commented 2 years ago

Do we know how the scanner completes its reading of the line? Does it look for a specific line-ending character when reading? Could be good to enforce terminating the read on a newline instead of an EOF if that may be impacting the current read point index for the file.

Just another datapoint tho - I added namepass under the tail input to test preventing any bad partial line writes and so far so good, though it may take some more soak time to really prove it out.

srebhan commented 2 years ago

@dr3amville, the scanner takes a split function deciding when to return a chunk. #11234 uses ScanLines returning lines (e.g. wait for newline char) and the whole file (i.e. no split function), but we are free to implement more by implementing a custom SplitFunc...

Does that help?

dr3amville commented 2 years ago

@srebhan I think just the #11234 implementation is plenty if that could be ported over.

srebhan commented 2 years ago

Just give it a shot! ;-P

dr3amville commented 2 years ago

I will have to pick up some Go 😔

powersj commented 2 years ago

@srebhan maybe I'm not following, but tail seems to be a different animal. For example, we aren't using a scanner, but instead waiting on the external tail library to send us data via our receiver function.

In the upstream tail library, which is our fork of hpcloud/tail, data is read new line delimited already.

Did I follow the right path or missed something else? Thanks!

srebhan commented 2 years ago

@powersj ok, I didn't look into tail when I should have done it... :-( If the underlying library has a delimiter already, we should use it.

jeme-0-jyskebank-dk commented 1 year ago

I had same issue, This fixed it for me: [[inputs.tail]] watch_method = "poll"

influxdata / telegraf