humio / issues

Issue Tracker for Humio
4 stars 2 forks source link

regex() doesn't support UTF-8 correctly #119

Closed neiser closed 4 years ago

neiser commented 4 years ago

Once using the regex() function in a query with named field extraction (in the example psst), the string is not displayed correctly if German umlauts (here ä) are present. When using the content of the field the regex is applied to, everything is fine (see screenshots)

ahe-humio commented 4 years ago

I'm not able to reproduce this:

Screen Shot 2020-08-11 at 10 15 54

neiser commented 4 years ago

@ahe-humio Thanks for your reply. Is there anything I can help with to reproduce this issue? I've just checked that the U+00E4 UTF-8 code position is indeed ä.

I also noticed that your test case differs to my situation in the sense that @rawstring is not a simple string (displaying the value correctly), but is indeed some "stringified json". A stripped example context of @rawstring is:

{"@timestamp":"2020-08-01T12:59:24.955Z","log":"{\"@timestamp\":1596286764954,\"level\":\"DEBUG\",\"thread\":\"...\",\"logger\":\"...\",\"message\":\"some str1(some str2(value=stell f\\u00E4lscher))\",\"traceId\":\"...\",\"spanId\":\"...\",\"tenant\":\"...\",\"testing\":false,\"userId\":\"...\"}\n","stream":"stdout","time":"2020-08-01T12:59:24.955308654Z", ... }

You can see that inside this rawstring JSON, the for message string contains the encoded value \\u00E4. Maybe you can adapt your test case accordingly and see if you can reproduce the issue then?

ahe-humio commented 4 years ago

I can reproduce it provided I use the following parser:

parseJson() | parseJson(log)

I upload it using this script:

json='"{\"@timestamp\":\"2020-08-12T05:59:24.955Z\",\"log\":\"{\\\"@timestamp\\\":1597219143000,\\\"level\\\":\\\"DEBUG\\\",\\\"thread\\\":\\\"...\\\",\\\"logger\\\":\\\"...\\\",\\\"message\\\":\\\"some str1(some str2(value=stell f\\\\u00E4lscher))\\\",\\\"traceId\\\":\\\"...\\\",\\\"spanId\\\":\\\"...\\\",\\\"tenant\\\":\\\"...\\\",\\\"testing\\\":false,\\\"userId\\\":\\\"...\\\"}\\n\",\"stream\":\"stdout\",\"time\":\"2020-08-01T12:59:24.955308654Z\"}"'

curl -v -X POST https://cloud.humio.com/api/v1/ingest/humio-unstructured \
-H "Content-Type: application/json" \
-H "Authorization: Bearer TOKEN-REMOVED" \
-d @- << EOF
[
  {
    "fields": {
      "host": "webhost1"
    },
    "messages": [$json]
  }
]
EOF
ahe-humio commented 4 years ago

I think the problem is that rawstring is JSON encoded. It literally has the characters that you see in the result of the regex. However, if I tell regex to search in the field message, it returns the expected result:

Screen Shot 2020-08-12 at 10 11 08

neiser commented 4 years ago

@ahe-humio Thanks for investigating this. I think this can be closed now. The solution is to apply the regex to the field which has the correctly decoded values, right?

ahe-humio commented 4 years ago

@neiser yes, that is what my reproduction leads me to conclude.

I'm currently not able to close bugs here, so thank you for closing it yourself 🙂