Improve documentation of syslog plugin

candlerb commented 6 years ago

I was very happy to see the inclusion of a syslog plugin. However the documentation is missing some critical information, in particular:

Does the event timestamp (influxdb timestamp) come from the time the message was received, or the timestamp contained within the message?
What happens for messages which contain the same timestamp?
Does the hostname field contain the source IP address of the syslog message, or the hostname contained within the message?

The timestamp issue is important because (a) many devices generate syslog messages containing only 1-second resolution timestamps, and (b) influxdb can only store one data point in a timeseries for a particular timestamp.

So I set out to find out by experimentation what happens. I used the sample config which receives TCP on port 6514, and the sample rsyslogd configuration which forwards to it. Then I did:

# logger foo; logger bar

tcpdump on port 6514 showed the following messages generated by rsyslogd:

60 <13>1 2018-09-27T07:41:24.310199+00:00 tick root - - -  foo
60 <13>1 2018-09-27T07:41:24.314932+00:00 tick root - - -  bar

And the results in the database:

> use telegraf
Using database telegraf
> select * from syslog where message =~ /^ (foo|bar)/
name: syslog
time                           appname facility facility_code host hostname message procid severity severity_code timestamp           version
----                           ------- -------- ------------- ---- -------- ------- ------ -------- ------------- ---------           -------
2018-09-27T07:41:24.311399024Z root    user     1             tick tick      foo           notice   5             1538034084310199000 1
2018-09-27T07:41:24.31600689Z  root    user     1             tick tick      bar           notice   5             1538034084314932000 1
>

From this I infer:

The event time (influxdb timestamp) is the time the message was received or processed, with nanosecond resolution
The timestamp field is the time contained within the message

Therefore, unless two messages are processed in the same nanosecond (unlikely), there should be no loss of messages. However any buffered or backlogged messages will be stored and ordered by time of reception.

This all seems fine to me, but it would have been very helpful if this had been documented.

The tags and fields are more-or-less in agreement with the documentation:

> show tag keys from syslog
name: syslog
tagKey
------
appname
facility
host
hostname
severity
> show field keys from syslog
name: syslog
fieldKey      fieldType
--------      ---------
facility_code integer
message       string
procid        string
severity_code integer
timestamp     integer
version       integer

However, the documentation says there is a tag "hostname", but I also see "host". So to find the difference between these, I did another test:

# echo "60 <13>1 2018-09-27T07:41:24.310199+00:00 xxxx root - - -  qux" | nc 127.0.0.1 6514

Result:

> select * from syslog where message =~ /^ qux/
name: syslog
time                           appname facility facility_code host hostname message procid severity severity_code timestamp           version
----                           ------- -------- ------------- ---- -------- ------- ------ -------- ------------- ---------           -------
2018-09-27T10:18:19.592731705Z root    user     1             tick xxxx      qux           notice   5             1538034084310199000 1

So it appears that the hostname is the hostname contained within the message. But what about host? Is it the source address of the syslog message?

The answer is no. I tried the same nc test from a different host (giving the IP of telegraf instead of 127.0.0.1), and got the same result.

2018-09-27T10:19:56.671438282Z root    user     1             tick xxxx      qux           notice   5             1538034084310199000 1

So it seems the host is just the host where telegraf itself is running. Again, would be very helpful if this was documented.

candlerb commented 6 years ago

STRUCTURED-DATA is documented, but it could be clarified with an example. I set up another test, using an example from RFC5424:

# echo '173 <165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] An application event log entry...' | nc 127.0.0.1 6514

Result:

> select * from syslog where "appname" = 'evntslog' limit 1
name: syslog
time                           appname  exampleSDID@32473_eventID exampleSDID@32473_eventSource exampleSDID@32473_iut facility facility_code host hostname              message                           msgid procid severity severity_code timestamp           version
----                           -------  ------------------------- ----------------------------- --------------------- -------- ------------- ---- --------              -------                           ----- ------ -------- ------------- ---------           -------
2018-09-27T10:37:13.151962895Z evntslog 1011                      Application                   3                     local4   20            tick mymachine.example.com An application event log entry... ID47         notice   5             1065910455003000000 1

telegraf has created new field keys:

> show field keys from syslog
name: syslog
fieldKey                      fieldType
--------                      ---------
exampleSDID@32473_eventID     string
exampleSDID@32473_eventSource string
exampleSDID@32473_iut         string
facility_code                 integer
message                       string
msgid                         string
procid                        string
severity_code                 integer
timestamp                     integer
version                       integer

candlerb commented 6 years ago

There is one other thing worth mentioning in the documentation. If you follow the example to get rsyslog to forward logs to telegraf on localhost:

# forward over tcp with octet framing according to RFC 5425
*.* @@(o)127.0.0.1:6514;RSYSLOG_SyslogProtocol23Format

... you should beware that in a default Ubuntu 16.04 installation, influxdb writes query logs to stderr, which are picked up by systemd and written to syslog. You can see them in /var/log/syslog as well as journalctl -xe _SYSTEMD_UNIT=influxdb.service

Sep 27 11:32:30 tick influxd[386]: [httpd] 127.0.0.1 - - [27/Sep/2018:11:32:30 +0000] "POST /write?db=telegraf HTTP/1.1" 204 0 "-" "telegraf" 0494c172-c249-11e8-b2aa-000000000000 13481

Those logs go to rsyslogd, which forwards them to telegraf, which writes them to influxdb... you get the picture.

Oddly, this doesn't cause the explosion of logs I was expecting, but I still haven't worked out why - maybe there is buffering/batching. Still: it might be wise to point out the possible risk.

danielnelson commented 6 years ago

Thanks @candlerb, these are some good observations that will really help getting started with this plugin.

maybe there is buffering/batching.

Telegraf writes in batches of metric_batch_size which is 1000 by default.

As mentioned on #4762, the host tag is a tag we set on all metrics by default, and represents the host running Telegraf. I'll comment further on the other issue on what we should do with hostname.

We will improve the structured data and timestamp documentation as suggested.

danielnelson commented 6 years ago

I added documentation about timestamp vs timestamp field and structured data to the syslog readme.

influxdata / telegraf

Improve documentation of syslog plugin #4759