influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.57k forks source link

Error starting telegraf for azure eventhub_consumer with persistence file #13322

Open AntonSigur opened 1 year ago

AntonSigur commented 1 year ago

I have successfully created a telegraf stream for device data from azure iot hub, using the eventhub_consumer.

However, to not re-reading and consuming again all (millions) of messages in the 7d buffer of the stream, I opted in using file persistence and not only in memory persistence - as per documentation.

After careful configuration and multiple tests and code review there is a bug.

I get....

E! [telegraf] Error running agent: starting input inputs.eventhub_consumer: creating receiver for partition "1": open [FILEPATH]/[IOTHUB-NAME]-24996945-d1f3232547_hanp1iottest_influxdb_0: no such file or directory

..in the log. (tested with multiple file locations and permissions)

The problem is, the persister does not create inital files for persistance, and then can't open them. Not sure where in the code the bug is, if it could be a result of some sort of a race condition or not.

Found a "silly" Workaround: Created the mentioned files in the log, withing the directory, with the content {} (empty JSON) and it seems to work as expected now, the files are persisting the state between restarts. This could easily break when adding new partitions in the eventhub, so you need to add new files as you add partitions.

Using latest telegraf agent in ubuntu 22. Telegraf 1.26.3 (git: HEAD@90f4eb29) @ Ubuntu 22.04.2 LTS

AntonSigur commented 1 year ago

This is probably the breaking change from upstream library: https://github.com/Azure/azure-event-hubs-go/commit/2a12765e337b95d1edeb5c390f67431accf8d938

Reading the persistance before writing any will result in an err instead of nil before. ...

srebhan commented 1 year ago

Upstream bug report https://github.com/Azure/azure-event-hubs-go/issues/280.

NuMove-JonathanSchmidt commented 12 months ago

Given the length of time this report has been open on azure-event-hub-go, would it be acceptable to filter this specific error and handle it in the plugin ?

powersj commented 12 months ago

There are at least 2 open issues around azure event hub that need to be sorted:

  1. Your new issue from today https://github.com/influxdata/telegraf/issues/14162
  2. 13322 (this issue)

Both issues stem from the azure-event-hubs-go library and client. It should handle or provide a method to handle these types of issues to avoid needing our code to handle specific error conditions.

What we should instead focus on is migrating away from the azure-event-hubs-go library and use the azeventhubs library as the library's readme recommends. I don't think we can expect support or updates from the existing library anyway.

I would be very happy to review a PR that first migrates to the new library. Then we can have you re-test to see if these issues still exist. If they do, they should have new issues opened in the new repo to gauge the response.

NuMove-JonathanSchmidt commented 12 months ago

I agree with the root cause.

If I was fluent in Go, or had someone from my team that was, I'd have been very happy to provide such a PR. No such luck, however.

I assume this plugin isn't high on your priority list ?

powersj commented 12 months ago

I assume this plugin isn't high on your priority list ?

The cloud services plugins are much more difficult to test and ensure compatibility as we are not direct end-users of them. However, I do know I can find users via these issues and you seem to have the ability to easily test these out.

NuMove-JonathanSchmidt commented 12 months ago

If you'd like a limited Event Hub sandbox with a read and write key to test things out, please don't hesitate to contact me directly. Influx sales has my contact info. Otherwise I'd be happy to test things out in a production-like environment.

NuMove-JonathanSchmidt commented 11 months ago

Further possibility, given the exposition of a Kafka surface by EventHub, do you think it would be worthwhile to switch to a Kafka output plugin instead ?

powersj commented 11 months ago

I was unaware of that. That might be an option worth trying, but I cannot say I know enough about EventHub in general to know any possible trade-off.

NuMove-IT commented 10 months ago

Hi @powersj,

The trade-off should be minimal, as the behavior of both services are quite close.

I was able to connect to EventHub for a producer with the following configuration :

[[outputs.kafka]]
brokers = ["<namespace>.servicebus.windows.net:9093"]
topic = "<topic-name>"
routing_tag = "host"
compression_codec = 0
required_acks = -1 #Set to 0 if anecdotal data loss is acceptable
max_retry = 3
max_message_bytes = 1000000
enable_tls = true
insecure_skip_verify = true
sasl_mechanism = "PLAIN"
sasl_username = "$$ConnectionString"
sasl_password = "{The actual connection string}"
sasl_version = 0
data_format = "influx"

And with a consumer with the following :

[[inputs.kafka_consumer]]
brokers = ["<namespace>.windows.net:9093"]
topics = ["<topic name>"]
version = "1.0.0"
sasl_mechanism = "PLAIN"
sasl_username = "$$ConnectionString"
sasl_password = "{The actual connection string}"
enable_tls = true
sasl_version = 0
consumer_group = "<consumer group name>"
compression_codec = 0
offset = "oldest"
connection_strategy = "startup"
max_message_len = 1000000
data_format = "influx"

All of the variables can either be inferred from the connection string or, like the consumer group, are also parameters for the Event Hub consumer.

Assuming you want to move forward with a migration of the Event Hub consumer, it should be fairly straightforward to parse the connection string and replicate the functionality with a maintained library behind the plugin.

Alternatively, a deprecation of the Event Hub plugin with the necessary configuration to use it with the kafka API would prevent people from having the same issues.

The only outstanding question is with index persistence, as I don't know if that functionality is supported by the Kafka consumer input plugin.

Best regards, Jonathan Schmidt

NuMove-IT commented 10 months ago

Persistence in this case is handled directly by the Kafka API on a per-consumer-group basis.

The one functionality that couldn't be replicated is the ability to start consuming from an arbitrary datetime. Kafka is limited to earliest or oldest offset only when there is no persistence data to fetch.

powersj commented 10 months ago

Thank you very, very much for digging into this! It is fantastic to see that a user can use the existing Kafka plugins. In either case, this is something we should document. Would you mind putting up a PR with a brief explanation?

The only outstanding question is with index persistence, as I don't know if that functionality is supported by the Kafka consumer input plugin.

I briefly looked at the Event Hubs docs, including their kafka migration guide, and didn't see this called out. That doesn't mean it might be problematic though :\

Assuming you want to move forward with a migration of the Event Hub consumer, it should be fairly straightforward to parse the connection string and replicate the functionality with a maintained library behind the plugin.

I also looked at what the new azeventhubs library provides with respect to the event hubs output plugin. It looks like the new client can use azeventhubs.NewProducerClientFromConnectionString to create a producer client, and generate batches to send. One difference is that the partition key is set per batch, and not metric, unless I missed something. I assume we would need to do some grouping then.

For the input plugin, I have not looked into it in detail, but the number of clients we create is a bit more involved. However, the key points would be to:

I do think we should try to migrate, even if we have to deprecate or ignore some options.