Closed neiser closed 4 years ago
I'm not able to reproduce this:
@ahe-humio Thanks for your reply. Is there anything I can help with to reproduce this issue? I've just checked that the U+00E4
UTF-8 code position is indeed ä
.
I also noticed that your test case differs to my situation in the sense that @rawstring
is not a simple string (displaying the value correctly), but is indeed some "stringified json". A stripped example context of @rawstring
is:
{"@timestamp":"2020-08-01T12:59:24.955Z","log":"{\"@timestamp\":1596286764954,\"level\":\"DEBUG\",\"thread\":\"...\",\"logger\":\"...\",\"message\":\"some str1(some str2(value=stell f\\u00E4lscher))\",\"traceId\":\"...\",\"spanId\":\"...\",\"tenant\":\"...\",\"testing\":false,\"userId\":\"...\"}\n","stream":"stdout","time":"2020-08-01T12:59:24.955308654Z", ... }
You can see that inside this rawstring JSON, the for message
string contains the encoded value \\u00E4
. Maybe you can adapt your test case accordingly and see if you can reproduce the issue then?
I can reproduce it provided I use the following parser:
parseJson() | parseJson(log)
I upload it using this script:
json='"{\"@timestamp\":\"2020-08-12T05:59:24.955Z\",\"log\":\"{\\\"@timestamp\\\":1597219143000,\\\"level\\\":\\\"DEBUG\\\",\\\"thread\\\":\\\"...\\\",\\\"logger\\\":\\\"...\\\",\\\"message\\\":\\\"some str1(some str2(value=stell f\\\\u00E4lscher))\\\",\\\"traceId\\\":\\\"...\\\",\\\"spanId\\\":\\\"...\\\",\\\"tenant\\\":\\\"...\\\",\\\"testing\\\":false,\\\"userId\\\":\\\"...\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2020-08-01T12:59:24.955308654Z\"}"'
curl -v -X POST https://cloud.humio.com/api/v1/ingest/humio-unstructured \
-H "Content-Type: application/json" \
-H "Authorization: Bearer TOKEN-REMOVED" \
-d @- << EOF
[
{
"fields": {
"host": "webhost1"
},
"messages": [$json]
}
]
EOF
I think the problem is that rawstring is JSON encoded. It literally has the characters that you see in the result of the regex. However, if I tell regex to search in the field message
, it returns the expected result:
@ahe-humio Thanks for investigating this. I think this can be closed now. The solution is to apply the regex to the field which has the correctly decoded values, right?
@neiser yes, that is what my reproduction leads me to conclude.
I'm currently not able to close bugs here, so thank you for closing it yourself 🙂
Once using the
regex()
function in a query with named field extraction (in the examplepsst
), the string is not displayed correctly if German umlauts (hereä
) are present. When using the content of the field theregex
is applied to, everything is fine (see screenshots)regex("some str1\(some str2\(value=((?<psst>.*))\)\)")