Closed loganmc10 closed 9 months ago
Hi, thanks for the issue. To clarify my understanding, if a field is going to be given a value that contains invalid characters, is the request that the field is dropped in this instance? Or another form of sanitization?
I guess sanitize would be good. Perhaps this is something that would fit best in a processor, giving the option to sanitize inputs via a processor before they are sent to the output plugin.
Go has a ToValidUTF8
function that allows you to replace any invalid characters with a replacement character (which could just be nothing). Some kind of a UTF8 sanitizer processer would be very helpful
I agree that a processor would make a lot of sense, there is currently a strings
processor that we likely could place this functionality into.
Not sure if it would work at all, but there is a starlark
processor which allows you to write your own processor script, if there's some easy way for you to identify for when your data should be replaced or removed. I'm not sure if starlark
supports non-utf-8 characters though.
Ok I've submitted a PR to add an option to the strings processor to sanitize strings using Go's ToValidUTF8 function: https://github.com/influxdata/telegraf/pull/9118
This issue is based on an incorrect assumption. Let's look at the OID in question jnxCmRescueChgDate
. This OID is from JUNIPER-CFGMGMT-MIB
and has syntax/type of DateAndTime
:
jnxCmRescueChgDate OBJECT-TYPE
SYNTAX DateAndTime
MAX-ACCESS read-only
STATUS current
DESCRIPTION
"The date and time when the rescue configuration was last
changed."
::= { jnxCmRescueChg 2 }
DateAndTime
is imported from SNMPv2-TC
, where we find its definition as:
DateAndTime ::= TEXTUAL-CONVENTION
DISPLAY-HINT "2d-1d-1d,1d:1d:1d.1d,1a1d:1d"
STATUS current
DESCRIPTION
"A date-time specification.
field octets contents range
----- ------ -------- -----
1 1-2 year* 0..65536
2 3 month 1..12
3 4 day 1..31
4 5 hour 0..23
5 6 minutes 0..59
6 7 seconds 0..60
(use 60 for leap-second)
7 8 deci-seconds 0..9
8 9 direction from UTC '+' / '-'
9 10 hours from UTC* 0..13
10 11 minutes from UTC 0..59
* Notes:
- the value of year is in network-byte order
- daylight saving time in New Zealand is +13
For example, Tuesday May 26, 1992 at 1:30:15 PM EDT would be
displayed as:
1992-5-26,13:30:15.0,-4:0
Note that if only local time is known, then timezone
information (fields 8-10) is not present."
SYNTAX OCTET STRING (SIZE (8 | 11))
You can see here that this field is not a "string" at all, rather an array of bytes that encodes a representation of a timestamp. For example the first two bytes, in network-byte order, are not two ASCII or UTF-8 characters, they are a 16-bit integer that represents the year.
This brings us to the incorrect assumption. While the syntax (or type) OCTET STRING
has the word string in it, it is an array of bytes. Unlike what the name suggests, an OCTET STRING
is not limited to string values but can store any byte based data type, including binary data. To somehow remove or replace values in an OCTET STRING
because they are not printable would destroy the data itself, rendering it useless. The OID might as well be thrown away entirely.
The real issue here isn't figuring out how to print things that are non-printable. It is that Telegraf's SNMP Trap input doesn't properly handle OCTET STRING
s (and some other things) as they are intended to be handled. Deleting or replacing chunks of data that Telegraf can't print isn't going to fix that.
@robcowart yes you are totally correct, I've updated the issue title and description (pointing to your explanation).
In my case, it's still useful for the time being to just sanitize the whole thing as utf-8, because the system I'm sending the data to (Grafana Loki) fails to parse anything if any part of any message in a search isn't utf-8. However, a proper solution would involve parsing this type of data into readable text.
For the time being, You could opt to drop that specific field from the measurement, so you can use it with Grafana and Loki.
Closing since linked PR got merged
Okay, could you clarify what is still missing then?
I already did that above... https://github.com/influxdata/telegraf/issues/9113#issuecomment-829585208
If you are not going to handle the data properly, at least don't destroy it by replacing non-printable characters with junk. It would be better to transform the raw byte slice into a printable form using something like hex.EncodeToString()
https://pkg.go.dev/encoding/hex#EncodeToString. This way you at least preserve the option for another application to properly transform the data at a later stage of processing.
Where do those non-printable characters get replaced? I'm willing to add functionality to transform bytes to hex encoding, but It's difficult to find out when to do it and when not. I don't think it is always the case when SYNTAX is OCTET STRING
?
any update on this? we still cannot decode SystemClock fields from snmp traps from standard routers like Cisco Juniper "systemClock": "\u0007\ufffd\u0003\u0015\u000b-*\u0001-\u0004\u0000",
Hi, we were running into the same issue, in this case jnxCmCfgChgLatestDate (1.3.6.1.4.1.2636.3.18.1.3.0), which is also an SNMP DateAndTime object. Using snmpwalk we got:
$ snmpwalk -v2c -cpublic 10.0.0.1 1.3.6.1.4.1.2636.3.18.1.3.0
SNMPv2-SMI::enterprises.2636.3.18.1.3.0 = Hex-STRING: 07 E6 0A 04 10 2A 30 00 2D 04 00
We are using the json serializer which gave us:
\u0007\ufffd\n\u0004\u0010*0\u0000-\u0004\u0000
Comparing the two outputs, somehow the first two bytes 07 E6
are mangled by telegraf and replaced with \u0007\ufffd
(maybe part of an internal data structure of telegraf?). This corresponds well to the observation of @adrianhardkor.
\u0007 ???
\ufffd ???
\n 0A
\u0004 04
\u0010 10
* 2A
0 30
\u0000 00
- 2D
\u0004 04
\u0000 00
We now worked around the issue by base64 encoding the value using json_transformation
, basically from:
json_transformation = '''
metrics{
tags.agent_host: {
"LastConfChange": fields.InfoLastConfChange,
}
}
'''
to:
json_transformation = '''
metrics{
tags.agent_host: {
"LastConfChange": $base64encode(fields.InfoLastConfChange),
}
}
'''
and base64 decoding the value in our application.
Looking at this case from the point of view of the json_test.go in the repo, I observed the following
The JSON serialiser prepares to handle all incoming data through interface
, which in the case of byte array data coming in first appears to assume it's interface {}([]uint8)
format but after calling metric.New, the uint8 array gets translated into a interface {}(string)
. This step still retains all the original data, though it does get translated strangely in the case of 0x07 which is common in year 2000+ timestamps. The second half 0xe8 of the year number becomes\xe8
which still contains the original data
Later, the Golang json.Marshal is called to actually serialise that data, but as that code follows this principle
// String values encode as JSON strings coerced to valid UTF-8,
// replacing invalid bytes with the Unicode replacement rune.
// So that the JSON will be safe to embed inside HTML <script> tags,
// the string is encoded using HTMLEscape,
// which replaces "<", ">", "&", U+2028, and U+2029 are escaped
// to "\u003c","\u003e", "\u0026", "\u2028", and "\u2029".
// This replacement can be disabled when using an Encoder,
// by calling SetEscapeHTML(false).
the 0xe8 becomes Unicode replacement character
0xFFFD since 0xe8 is considered an invalid byte. This is the stage where data is lost
So unless these byte array timestamps are desired to be unpacked inside telegraf into their textual representation, the byte array content should be passed through as-is without attempting char conversion of the bytes, just dump the bytes with hex.EncodeToString()
before calling the metric.New
System info:
1.18.1 running in docker
Steps to reproduce:
When using the snmp_trap input plugin, some devices send traps with non utf-8 encoded strings, for instance:
Expected behavior:
See https://github.com/influxdata/telegraf/issues/9113#issuecomment-829585208
Actual behavior:
See https://github.com/influxdata/telegraf/issues/9113#issuecomment-829585208
Additional info: