Closed rceara closed 1 year ago
Hi,
No collection start, collection end and any other message related to this input data:
Why are those fields important? In terms of timing, the timestamp is set based on the timestamp field if data if it exists.
What I see with the other collectors (2 messages rather than 1)
Telegraf generates metrics. I would be careful to not to compare it to a data collector. If you take a step back, every metric produced by telegraf will use some set of fields. In general, we do not pass through all data collected in all plugins. Someone at some point made a decision about what is relevant information, what data to collect, and how to craft a metric out of that data.
The issue seems to be very clear, the data is not coming complete with all the messages that involves the metadata of a grpc message.
It would be far more helpful if you called out what additional fields you think are missing and need to be added, and more importantly why those fields need to be added.
Thanks!
Hi,
No collection start, collection end and any other message related to this input data:
Why are those fields important? In terms of timing, the timestamp is set based on the timestamp field if data if it exists.
What I see with the other collectors (2 messages rather than 1)
Telegraf generates metrics. I would be careful to not to compare it to a data collector. If you take a step back, every metric produced by telegraf will use some set of fields. In general, we do not pass through all data collected in all plugins. Someone at some point made a decision about what is relevant information, what data to collect, and how to craft a metric out of that data.
The issue seems to be very clear, the data is not coming complete with all the messages that involves the metadata of a grpc message.
It would be far more helpful if you called out what additional fields you think are missing and need to be added, and more importantly why those fields need to be added.
Thanks!
I just added a comment explaining what we are missing. Those fields are important because the device doesn't send all data in 1 message so the collection start time and the collection end time are not in the same message. This raw data can be send to an outbound file to verify if the raw data is coming complete and not in pieces. It's very hard to troubleshoot this with TCPDUMP (which we did) to verify that some data was missing when we dump it to a textfile on telegraf. We needed to use another collector to verify this was the case as well and confirmed that telegraf was missing some important fields that were not populated on the text file.
What we are missing on the outbound message showed in Telegraf: The source of the original message with the port number: Source": "10.93.178.70:59841 The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data and collection_start_time: 1678740019758 and collection_end_time: 1678740019763. Please notice that the metadata is coming on 2 different message so the 1st message doesn't have the collection_end_time.
The source of the original message with the port number: Source": "10.93.178.70:59841
This sounded familiar and a lot like https://github.com/influxdata/telegraf/issues/11920 Is that the same issue?
The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data
This is stored as the tag path here, which is in your example above. Is your point that this second message does not have it? Is this message actually in the 2nd message?
collection_start_time: 1678740019758 collection_end_time: 1678740019763
Are these ever collected by telegraf in any message?
The source of the original message with the port number: Source": "10.93.178.70:59841
This sounded familiar and a lot like #11920 Is that the same issue?
The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data
This is stored as the tag path here, which is in your example above. Is your point that this second message does not have it? Is this message actually in the 2nd message?
collection_start_time: 1678740019758 collection_end_time: 1678740019763
Are these ever collected by telegraf in any message?
I shared 2 different examples: One example from the telegraf collector and another example from another collector. The output that i sent from telegraf is different from the output I shared from the other collector where it shows the raw data that is coming via gRPC. So, as you can see and understand the difference in the output and what is missing.
The issue on #11920 seems to be different because what I'm explaining is that we are missing information that is not being populated to the output of the textfile when telegraf process the data. On the Expected behavior (testing with the other collector) is what I verify using that collector (not telegraf) and is bringing all the raw data as it was received from the Cisco device.
Telegraf is not passing to the output textfile document all the information and messages received as I explained above: The source of the original message with the port number: Source": "10.93.178.70:59841 The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data and collection_start_time: 1678740019758 and collection_end_time: 1678740019763.
The collection_start_time and collection_end_time are coming in 2 different messages.
Please let me know if is clear and make sense.
Please let me know if is clear and make sense.
I am sorry, but I am not following at all. You had identified some fields that may or may not be missing and coming in different messages. I think you are making some assumption that you want telegraf to combine these messages or wait for the second to have all the data, but again not following
Please let me know if is clear and make sense.
I am sorry, but I am not following at all. You had identified some fields that may or may not be missing and coming in different messages. I think you are making some assumption that you want telegraf to combine these messages or wait for the second to have all the data, but again not following
Its pretty simple: In my original message I shared the output that telegraf is dumping on the output textfile (Logs from Telegraf). If you compare the "Logs from Telegraf" vs What I shared on the "Expected behavior" from another collector I'm using (not telegraf collector), you will notice that we are missing on the telegraf output: the port number: Source": "10.93.178.70:59841 The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data and collection_start_time: 1678740019758 and collection_end_time: 1678740019763. Does it make sense?
What I shared on the "Expected behavior" from another collector
Show me what you expect to get from Telegraf, not some other collector please.
What I shared on the "Expected behavior" from another collector
Show me what you expect to get from Telegraf, not some other collector please.
I'm expecting to see in the output.textfile the source with the port of the sender, the encoding_path from the 1st and last message that contains the data, the collection start and collection end time.
Example: The source of the original message with the port number: Source": "10.93.178.70:59841 The encoding_path of the 1st and last message: Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data and collection_start_time: 1678740019758 and collection_end_time: 1678740019763.
Let me jump into this discussion... @rceara you are saying that your device sends one metric in two parts as two separate messages, the first one basically contains all data that is currently translated to a metric by Telegraf and a second message that contains some additional meta-data like the collection-end time. Is this understanding correct?
If so, I would be interested to learn how Telegraf can know that there is a second message following before it arrives. This is important for Telegraf to "hold-back" the first metric and fuse it with the second one (e.g. override the collection_end_time
with the value of the second message).
Furthermore, how can Telegraf know which messages belong together? It seems like the msg_timestamp
is different between the two messages while node_id_str
, subscription_id_str
andcollection_id
are identical. This also touches my previous question: How do we know we should fuse the two messages? When do we start a new metric?
Your statement and explanation is correct. We don't send all the data in one message but in multiple messages. Therefore, we start all messages with a collection_start_time that is the same but don't close the collection_end_time until all messages of that specific group/collection is sent. On each message we send, we do it with a timestamp, to track when each message was sent out of the device. The collection_start_time and collection_end_time are unique for each set/group of messages the device send with the data. My question is: Why telegraf doesn't process in an outbound textfile the source:port, collection_start_time and collection_end_time if those fields are also part of the grpc message? It's possible that telegraf process all the data as received (raw data) and dump it to the outbound.textfile so in case any troubleshooting needs to be done we can do it by verifying the timers of the collection_start_time and collection_end_time? As I said before, If I do a packet capture with tcpdump I can see all these information in the info header message of the tcp packet but is extremely difficult to troubleshoot that way a problem related to the collection of the data. It will be awesome if telegraf can show how the raw data is coming, rather than showing consumable data (metrics) only. Maybe we can have a new plugin called outbound.rawdatafile showing how the raw data is coming to the collector? Just floating some ideas so we kept the existing outbound.file the same and create a new outbound plugin for that purpose :)
@rceara I think you are somehow misinterpreting the intention or concept behind Telegraf. Let me clarify:
Telegraf is an agent for collecting, processing, aggregating, and writing metrics
This is the first line in the Readme. Telegraf is not a "collect whatever data-source you have and write the raw data to a file" agent. This being said, the idea is to query different data-sources (or provide listeners as for the cisco_telemetry_mdt
input) and convert/standardize/transform it to a metric. This transformation might be lossy for some inputs but it allows a downstream user to compute using those metrics (e.g. computing statistics over different types of devices). So an outbound.rawdatafile
plugin is out-of-scope for Telegraf. If you only require this for debugging, feel free to add a PR to enable debug logging the raw data.
This being said, we are currently missing the fields you mentioned because they are not extracted in the current code. You can correct this by opening a pull-request to add the missing fields to the metric created by Telegraf. I'm more than happy to review such a PR.
Regarding the merging of the multiple messages you mention, I'm not sure if your approach is the general approach for all devices supporting Cisco MDT. Can you elaborate on this? If this is the way all devices handle message splitting, I'd like to see a PR for fixing Telegraf. If not, you can still submit a PR and enable this type of message merging by adding a new option.
Ok, understood your message and sounds good to me! I will open a PR internally for the enablement of debugging login for the raw data for cisco_telemetry_mdt. I think that will make sense. Again, the information you are collecting and sending to any outbound (Chronograf, influxdb, textfile) is correct but we are missing some important fields (source:port, encoding_path, collection_start_time and collection_stop_time) that are very important for debugging purpose.
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you!
Relevant telegraf.conf
Logs from Telegraf
I believe this is an issue with the cisco_telegraf_mdt input plugin that is only processing consumable information (data of the measurements) and not all the messages/metadata that is received/coming from the device to the Telegraf collector.
An example of what I see in Telegraf (only 1 message) when sending the data to an output.file. No collection start, collection end and any other message related to this input data:
$ tail -F /etc/telegraf/telegraf-grpc-mtls.log
:System info
Telegraf 1.21.4+ds1-0ubuntu2, Ubuntu 22.04
Docker
No response
Steps to reproduce
If you need to test with a collector please go to the following link to grab one Cisco device: https://devnetsandbox.cisco.com/RM/Diagram/Index/f2e2c0ad-844f-4a73-8085-00b5b28347a1?diagramType=Topology
Sniff of the config to be loaded:
Expected behavior
What I see with the other collectors (2 messages rather than 1) providing more details such as: start and end of the collection. The timestamp is inside the message but that is different from the start/end of when the data was send from the device to the collector.
What we are missing on the outbound message showed in Telegraf: The source of the original message with the port number: Source": "10.93.178.70:59841 The encoding_path:Cisco-IOS-XE-wireless-access-point-oper:access-point-oper-data/capwap-data and collection_start_time: 1678740019758 and collection_end_time: 1678740019763. Please notice that the metadata is coming on 2 different message so the 1st message doesn't have the collection_end_time.
message 1 with all the data of the measurement
message 2 which is correlated to the 1st message
Actual behavior
I believe this is an issue with the cisco_telegraf_mdt input plugin that is only processing consumable information (data of the measurements) and not all the messages/metadata that is received/coming from the device to the Telegraf collector.
An example of what I see in Telegraf (only 1 message) when sending the data to an output.file. No collection start, collection end and any other message related to this input data:
$ tail -F /etc/telegraf/telegraf-grpc-mtls.log
:Additional info
I provided additional information of a sandbox device you can use from developer.cisco.com in case you want to reproduce the problem with a collector. The issue seems to be very clear, the data is not coming complete with all the messages that involves the metadata of a grpc message.