Closed alexmc1510 closed 5 months ago
@alexmc1510 I do have a few questions... You are saying
Start to publish "single line" very high frequency data to a topic
what order are we talking here about? Are you using any additional processors or aggregators? What are your settings for the output batch size and flush-interval?
Hello, sorry for the late response.
Can you reformulate your doubt? What I mean with: "Start to publish "single line" very high frequency data to a topic" is that the device publish data to a specific topic at very high frequency and the final end of the config is to create batches of lines (multiline message) in the output in order to avoid network load and packet missing. Regarding additional processors or aggregators, as you can see in the config file, I have dedup, There is no more config than the one I have mentioned. The settings for the output are mentioned in the config file and if you refer to the general telegraf configuration part, here you are:
# Global tags can be specified here in key="value" format.
[global_tags]
# dc = "us-east-1" # will tag all metrics with dc=us-east-1
# rack = "1a"
## Environment variables can be used as tags, and throughout the config file
# user = "$USER"
# Configuration for telegraf agent
[agent]
interval = "10s"
round_interval = true
metric_batch_size = 1000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
debug = false
quiet = false
hostname = ""
omit_hostname = false
@alexmc1510,
Can you reformulate your doubt?
@srebhan is asking how many metrics are you sending when you see this occur?
Based on what you have provided so far, I have no insight into what is missing, how much data you think is missing, or even why you think data is missing.
[[processors.dedup]]
You are using a dedup processor, which means if something is considered to be a duplicate it would be dropped.
Hello. Sorry for the late reply. I have done a deep debug of the issue but without a more detailed log I don't know how to really continue. At least, I see that the error is constrained into telegraf. Let me explain the test environment: IOT device with the following configuration:
In both scenarios I have detected missing data and I have debugged one by one:
No missing data in Edge broker
No missing data in bridge or edge Telegraf
I forwarded the output to local file in order to discard Influx as a potential candidate.
I see missing data in the Telegraf instance in charge of collecting data from MQTT and sending it to InfluxDB. In both scenarios exactly at the same time.
I see the following error in the telegraf log:
2024-05-13T07:30:10Z E! [inputs.mqtt_consumer::inputs_mqtt] Error in plugin: connection lost: pingresp not received, disconnecting
2024-05-13T07:30:10Z D! [inputs.mqtt_consumer::inputs_mqtt] Disconnected [tcp://xxxxxxxxxx:1883]
I see the following "CPU" glitch in telegaf container:
I don't see any network glitch in telegraf container.
Important to mention that telegraf and mosquitto are running as containers in the same computer so...there should not be any network error.
Could someone give me an idea about what could I debug in order to find the root cause?
Thanks in advance
I forwarded the output to local file in order to discard Influx as a potential candidate.
Forwarded the output of what? Telegraf? If not, have you used outputs.file to checked that?
I see the following error in the telegraf log: I don't see any network glitch in telegraf container.
These two statements disagree with each other. That error looks like you could miss some metrics. Did you track this down further? Have you enabled client_trace = false
in your mqtt_consumer config to get all debug statements from the MQTT client itself?!
Important to mention that telegraf and mosquitto are running as containers in the same computer so...there should not be any network error.
That is not a safe assumption. You could absolutely have some sort of misconfigured item, some sort of limiting, DNS interruption, etc. that occurs while running in docker.
Nothing so far points at an issue that is actionable or actually in Telegraf. Without any additional information this issue will be closed.
Hello,
I have deep into the detail of the error doing the following:
2024-05-14T12:38:54Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startIncoming Received Message
2024-05-14T12:38:54Z E! [outputs.influxdb::outputs_influxdb_prod] When writing to [https://myinfluxserver:8086]: received error write failed: partial write: field type conflict: input field "value" on measurement "Err Servo Y" is type integer, already exists as type float dropped=4; discarding points
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [pinger] ping check14.997487436
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [pinger] pingresp not received, disconnecting
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] internalConnLost called
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] stopCommsWorkers called
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] internalConnLost waiting on workers
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] stopCommsWorkers waiting for workers
2024-05-14T12:38:55Z D! [outputs.influxdb::outputs_influxdb_prod] Wrote batch of 1000 metrics in 315.35891ms
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] logic waiting for msg on ibound
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startIncomingComms: got msg on ibound
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startIncomingComms: received publish, msgId:0
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] logic waiting for msg on ibound
2024-05-14T12:38:55Z D! [outputs.influxdb::outputs_influxdb_prod] Buffer fullness: 1490 / 10000 metrics
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] incoming complete
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startIncomingComms: ibound complete
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startIncomingComms goroutine complete
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] outgoing waiting for an outbound message
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] outgoing waiting for an outbound message
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] outgoing waiting for an outbound message
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] outgoing comms stopping
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [net] startComms closing outError
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [router] matchAndDispatch exiting
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] incoming comms goroutine done
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] startCommsWorkers output redirector finished
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] stopCommsWorkers waiting for comms
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] stopCommsWorkers done
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] internalConnLost workers stopped
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] BUG BUG BUG reconnection function is nil<nil>
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [msgids] cleaned up
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] internalConnLost complete
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] [client] status is already disconnected
2024-05-14T12:38:55Z E! [inputs.mqtt_consumer::inputs_mqtt] Error in plugin: connection lost: pingresp not received, disconnecting
2024-05-14T12:38:55Z D! [inputs.mqtt_consumer::inputs_mqtt] Disconnected [tcp://172.17.0.4:1883]
2024-05-14T12:38:56Z D! [outputs.file::outputs_file_mydevice1] Wrote batch of 203 metrics in 2.439502ms
2024-05-14T12:38:56Z D! [outputs.file::outputs_file_mydevice1] Buffer fullness: 0 / 10000 metrics
2024-05-14T12:38:56Z D! [outputs.file::outputs_file_mydevice2] Wrote batch of 133 metrics in 6.940891ms
2024-05-14T12:38:56Z D! [outputs.file::outputs_file_mydevice2] Buffer fullness: 0 / 10000 metrics
2024-05-14T12:38:56Z D! [outputs.file::outputs_file] Wrote batch of 665 metrics in 8.032494ms
2024-05-14T12:38:56Z D! [outputs.file::outputs_file] Buffer fullness: 0 / 10000 metrics
2024-05-14T12:38:56Z E! [outputs.influxdb::outputs_influxdb_prod] When writing to [https://myinfluxserver:8086]: received error write failed: partial write: field type conflict: input field "value" on measurement "Err Servo X" is type float, already exists as type integer dropped=3; discarding points
2024-05-14T12:38:57Z D! [outputs.influxdb::outputs_influxdb_prod] Wrote batch of 1000 metrics in 396.769104ms
2024-05-14T12:38:57Z E! [outputs.influxdb::outputs_influxdb_prod] When writing to [https://myinfluxserver:8086]: received error write failed: partial write: field type conflict: input field "value" on measurement "Err Servo X" is type float, already exists as type integer dropped=1; discarding points
2024-05-14T12:38:57Z D! [outputs.influxdb::outputs_influxdb_prod] Wrote batch of 493 metrics in 291.718099ms
2024-05-14T12:38:57Z D! [outputs.influxdb::outputs_influxdb_prod] Buffer fullness: 0 / 10000 metrics
2024-05-14T12:39:00Z D! [inputs.mqtt_consumer::inputs_mqtt] Connecting [tcp://172.17.0.4:1883]
And answering to your question:
Forwarded the output of what? Telegraf? If not, have you used outputs.file to checked that?
With forwarding I mean, sending the same data to an outputs file.
Now I am sure that the error is constrained to telegraf and I don't really understand why a pingcheck time shorter than others is crashing with error "pingresp not received, disconnecting".
Could you suggest me how to continue the debugging activities?
Thanks in advance
[inputs.mqtt_consumer::inputs_mqtt] [pinger] pingresp not received, disconnecting
As this is looking like networking issues, my suggestion is to simplify your set up first. Remove things out of the containers first and make sure your networking config in this set up is not the source of your issues.
What ports do you have opened? Do you have a very strict firewall set up between these containers? Can you reproduce or see this same behavior if in the same container or if they are both outside?
Hello, thanks for your quick response. Forgive me but what do you mean with "Remove things out of the containers first." and with "same container or outside"? My configuration is telegraf running in one container (latest image) and mosquitto (latest image) in a different one. Port between containers is opened and I have tried configuring the mqtt server url first with an external URL and after with the internal IP or the bridge internal network of docker. Same behavior in both configurations.
I'm suggestion running this outside containers. Remove networking or any configuration between containers as potential cause. I say this because this is a common set up for users.
I will try to run telegraf outside a container, nevertheless, it will not help to answer the question:
Regards
Why a pingresp shorter than some others is crashing the connection chain?
I am not an expert on mosquitto and can't provide any insight into that.
Hello, sorry for the late reply. during the weekend I have done some test and progress with the problem:
InfluxData Community I have configured my telegraf with the following information based on @srebhan message:
Telegraf will collect messages until either flush_interval (±jitter) is reached or metric_batch_size number of metrics arrived. So in your case, you will receive 5k messages (due to your max_undelivered_messages setting and then Telegraf waits for the 30 seconds (flush_interval) to pass by.
So in your case, you are filling in the 5k messages in the first 5 seconds and then wait 25 seconds to flush the metrics as the metric_batch_size is never full.
As a solution I would increase your max_undelivered_messages to say twice the metric_batch_size. Make sure that metric_buffer_limit is still greater than the batch size by margin (say e.g. factor 2 or more). You can additionally reduce the flush_interval to control the maximum latency for your metrics if the rate drops for some reason.
interval = 10s metric_batch_size = 2500 metric_buffer_limit = 10000 qos = 1 max_undelivered_messages = 5000
Now it is working like a charm. Nevertheless, I have modified the parameters based on the message but not really understanding the meaning of the parameters. Could you clarify a bit how they impact the data capture? Why de default ones were not working properly?
Thanks in advance
max_undelivered_messages = 5000
The MQTT consumer input plugin will consume messages as it can. The max_undelivered_messages
sets the upper limit to the number of messages it will consume at one time. By default, this is 1000, you have increased it to 5000. This will allow the input to consume more messages at any given time.
interval = 10s
This setting is essentially ignored by the mqtt consumer as it will go and read messages as it needs to. We do have some connection checking during at each interval, but the plugin does not read or generate metrics at this interval.
metric_buffer_limit = 10000 metric_batch_size = 2500
These are the buffer limit or how many metrics Telegraf will buffer at any given time. And the batch size, how many metrics, Telegraf will send at each flush interval, default 10s.
Why de default ones were not working properly?
This is still not clear to me either as your charts don't really explain what data you were capturing and why it might not get captured.
Glad you got it working so I'll close this.
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.29.5 Windows 10
Docker
No response
Steps to reproduce
Expected behavior
Error on debug mode or non missing information
Actual behavior
Random missing packets of data
Additional info
Both signals have the same packet lost...meaning, the issue is related to a packet loss. The time window is exactly 4 sec, the size of a packet.
Full log:
telegraf.2024-03-26-1711458535.log