Broken core product promise: missing documentation how to setup more than one input stream

wolfgangr commented 3 weeks ago

Use Case

rely on telegraf documentation
Aggregation of multiple sensor data streams into different buckets
slowly build up system and experience
experiment with new streams in running environment

e.g. my home automation scenario on a farm:

weather (a local and a remote station)
energy
grain store (multiple temp/hum sensors)
maybe some machines in the future

Expected behavior

After reading the documentation, I can install and configure telegraf to deliver its promised functions

Actual behavior

documentation is incomprehensible
with autogenerated template (by influx), manual edit and loads of educated guess I managed my first productive stream (reading grid power from mqtt traffic and merge it via Influx and Grafana with evcc data)
when I tried test data from Influx generated template, I found my productive stream clobbered
when I tried to test another stream (pull weather data from a mysql database) telegraf screwed up (details not important here)

Additional info

Thank you, friends, for delivering this maybe great product. :bow: But - sorry to say - imho largely inappropriate documentation (really sorry) is prone to spoil all your diligent work. :sweat:
Please allow me some humble feedback to assist you in improving it.

Quick fix ahead: :pray: Put a "further reading"-Link to this blog post onto every documentation page that contains anything related to "config*" https://www.influxdata.com/blog/telegraf-best-practices/

Make clear in the documentation

that by default, all input streams are lumped together to all output streams
that skd of "routing" exists and how to avoid mixing up streams by implementing it
that for mixing of experimental and productive streams, multiple telegraf instances are considered good practice, (not evil hack, as with most other server software)

"somewhere on the internet" I remember a chart with input / processing / aggregation / output streams, but even searching twice, I can't find it in the documentation any more. Would be a good start, though.

I'd consider moving /lib/systemd/system/telegraf.service to /etc/systemd/system, since that where I'd expect service files intended to be fiddled with. Anything in /lib/.. imho commonly is considered "only to be fiddled with by brave developers on a testing environment"

In particular, I found this section completely misleading https://docs.influxdata.com/telegraf/v1/configuration/#telegraf-doesnt-support-partial-configurations

what is a "valid" configuration? From context, I inferred to expect an independent stream (globals, input, output) It was only by hard experience and mentioned "best practice" blog to learn otherwise.
"effective configuration is the union of all the files" - I had to learn the hard way that this means lumping all input streams together
obviously the [global tags] | [agent] section is nevertheless allowed only once

Putting [global tags] into every config file (as one might infer from "valid configuration"), yields the misleading W! Overlapping settings in multiple agent tables are not supported: may cause undefined behavior

This is short to end up in "Following the official documentation causes undefined behaviour". Not really what I expect for a celebrated product... :face_with_spiral_eyes:

Searching the internet for e.g. this message or for "telegraf multiple config files" yield loads of other victims crying for the same problem. Most of them only partially answered - or not at all. Some gracefully "send my your configs and I, the Master, will savior you". Well, fine, for one.

Only after hours of searching, dozens of config changes and restart, buckets & api-tokens deleted and recreated, short of thrashing your great piece of work and reverting to good old PERL demons, I luckily encontered said "best practice" article.

So why not teaching telegraf users how to do fishing instead giving them a piece of fish?

srebhan commented 3 weeks ago

@wolfgangr thanks for your honest feedback! We are well aware that we are weak on the documentation side. Any help there is appreciated as developers are often not able to take a (novice) user's view on things.

This being said, I'm currently discussing with our docs team on how we can auto-generate the website documentation from what we have on GitHub to avoid misalignment and make the repository the single source of truth. This will probably take a while and the documentation structure might change a few times (split markdown files, add missing information in plugins, separate out developers docs etc.). However we will try to preserve the content.

So if you want to contribute and don't mind we are moving stuff around, please feel more than welcome to improve what we have!

wolfgangr commented 2 weeks ago

This sounds like an invitation, eh? :thinking:

developers are often not able to take a (novice) user's view on things.

I can feel with you. Ages ago, when I was engaged in IT trainig, I recieved best marks when as a trainer I was just one lessen ahead of my clients :see_no_evil:

So I've learned that I'm not the best trainer, too. While I might urge myself to novices view, there's no intrinsic disposition to do so, like I've seen on "borne" trainers. I'd rather jump up the learning curve and quickly leave others behind.

So for the moment, I'll simply trace my learning here as an example where I found the core hurdles - and solutions. Maybe it can serve as a starting point / collection bin for skd of "description of big picture" which I feel missing.

wolfgangr commented 2 weeks ago

Next thing after said "best practice" blogpost I encountered with "heureka" feelings was the genuine source documentation:

https://github.com/influxdata/telegraf/tree/release-1.32/docs
in particular
https://github.com/influxdata/telegraf/blob/release-1.32/docs/CONFIGURATION.md

There is also an ASCII template for the "big picture" I've mentioned
https://github.com/influxdata/telegraf/blob/release-1.32/docs/AGGREGATORS_AND_PROCESSORS.md and even some idea of routing glimpses through the text

Up to that finding, I (misleadingly?) considered https://docs.influxdata.com/telegraf/v1/
as the authoritative source.

So may be, just pointing to the genuine / authoritative github docu in the footer of autogenerated (?) ("bee and flower"-type?) manuals might ease the jump up the learning ladder?

wolfgangr commented 2 weeks ago

Recent "heureka" out of the "config syntax quagmire" I found this morning:

https://github.com/influxdata/telegraf/blob/release-1.32/docs/TOML.md
https://toml.io/en/v1.0.0

And to get a live feeling while playing with configs, I installed a command line TOML parser ..... apt-get install yq

Whit that I can e.g. print a logical (JSON ?) representation of my config,

{
  "outputs": {
    "influxdb_v2": [
      {
        "urls": [
          "${INFLUX_URL}"
        ],
        "token": "foo-bar-very-secret-asdf==",
        "organization": "${INFLUX_ORG}",
        "log_level": "debug",
        "bucket": "lost_and_found",
        "bucket_tag": "target_bucket",
        "exclude_bucket_tag": false
      }
    ]
  }
}

I also may strip it down to skd. of minmimalistic (mybe even canonical?) TOML
So I can keep all the comments in my configs and easily zoom out to the forest without dropping all the trees.

$ tomlq  -t . generic_influx2_out.conf
[outputs]
[[outputs.influxdb_v2]]
urls = [ "${INFLUX_URL}",]
token = "Vo9lLij71a7ZdEqggk6TfyCiZ9F_RmWybm527a-vj1_bCfaZ8RZNvvGFPMiqVM2B801C9UlgI5x8k23dmVAHwQ=="
organization = "${INFLUX_ORG}"
log_level = "debug"
bucket = "lost_and_found"
bucket_tag = "target_bucket"
exclude_bucket_tag = false

wolfgangr commented 2 weeks ago

Next thing I'd look for is a comprehensive guide to routing.
Is there any such thing around already?
If not, what were the proper place and format to create one?
Could I expect assistance for answering questions and proof reading?
@srebhan , does this sound like a "may be yes" ?

The (obvious?) picture in mind when opting for the telegraf/influx/grafana chain was

collecting data from different mqtt, logread, sql databases, rrd databases, api(like evcc), modbus sensors,
stream buffering (as e.g. in MySQL replication) for redundancy and to protect at partial component downtimes
conditional filtering and processing
storing to different context
downsampling /aggregating
retrieving / backup / charting

At the top (where data comes from) and middle (storage = Influx) to bottom (grafana charting, influx UI and Flux for casual queries) the picture gets shape.

The opaque section is how the variety of sources captured by telegraf find their way through filters and aggregates to telegraf buckets.

Can I use the namedrop and tag routing as mentioned here
https://www.influxdata.com/blog/telegraf-best-practices/
extend throughout all the chain?
Particularly in filters and aggregates, too?

wolfgangr commented 2 weeks ago

Is this

cat *.conf | tomlq .
cat *.conf | tomlq . -t

what the clause

"effective configuration is the union of all the files"

boils down to?

So to understand how config pieces are merged, we search the web for " merge TOML", OK?

As far as I could figure out now

we can add deeper nested levels to some config section just by throwing them into skd of telegraf.d
we can NOT do so with simple key-value pairs

srebhan commented 2 weeks ago

@wolfgangr sorry for the late reply, but even Telegraf maintainers need some rest on weekends. ;-)

My initial post was meant as an invite. ;-) So yes, I would love to see a "Getting started" type of guide in our docs including the schematic interplay of plugins, description of differences between "listener"/"consumer" type of event-driven plugins and "normal" plugins that query on a fixed interval, routing, filtering etc with some pointers to additional documentation.

As some posts of you piled up, could you collect your questions and post them here or in the #telegraf channel on our Slack?

influxdata / telegraf