logstash-plugins / logstash-output-elasticsearch

https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
Apache License 2.0
216 stars 306 forks source link

Handling non UTF-8 data. #1168

Closed mashhurs closed 4 months ago

mashhurs commented 4 months ago

Description

Current buggy behaviours:

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version) - any, including main (v8.14) branch, es-output-v11.22.2
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker) - any, including main (v8.14) branch, es-output-v11.22.2
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes)
  4. How was the Logstash Plugin installed - default, current es-output-v11.22.2

JVM (e.g. java -version):

If the affected version of Logstash is 7.9 (or earlier), or if it is NOT using the bundled JDK or using the 'no-jdk' version in 7.10 (or higher), please provide the following information:

  1. JVM version (java -version)
  2. JVM installation source (e.g. from the Operating System's package manager, from source, etc).
  3. Value of the JAVA_HOME environment variable if set.

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including (e.g.) pipeline definition(s), settings, locale, etc. The easier you make for us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Use following pipeline config, save as encoding_test.conf in config folder

    input { generator { count => 1 } }
    filter { ruby { code => 'str = "\xAC"; event.set("message", str)' } }
    output {
    elasticsearch {
    cloud_id => "cloud_id"
    cloud_auth => "elastic:{pwd}"
    http_compression => "${HTTP_COMPRESSION}"
    }
    stdout { }
    }
  2. Run with HTTP compression enabled with HTTP_COMPRESSION=true bin/logstash -f config/encoding_test.conf and observe that ES rejects the event because of invalid UTF-8 payload

  3. Run with HTTP compression enabled with HTTP_COMPRESSION=false bin/logstash -f config/encoding_test.conf and observe that ES indexes the event without issue.

Provide logs (if relevant):

# HTTP_COMPRESSION=true bin/logstash -f config/encoding_test.conf --enable-local-plugin-development

[2024-03-15T15:22:19,117][DEBUG][org.apache.http.impl.conn.PoolingHttpClientConnectionManager][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Connection released: [id: 0][route: {s}->https://host.elastic-cloud.com:443][total available: 1; route allocated: 1 of 100; total allocated: 1 of 1000]
[2024-03-15T15:22:19,119][ERROR][logstash.outputs.elasticsearch][main][999000c22ac1744372923039d3bee405a92df01b3dafcd64f0830a24ad60acc6] Encountered a retryable error (will retry with exponential backoff) {:code=>400, :url=>"https://host.elastic-cloud.com:443/_bulk?filter_path=errors,items.*.error,items.*.status", :content_length=>248, :body=>"{\"error\":{\"root_cause\":[{\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to type\"}],\"type\":\"parse_exception\",\"reason\":\"Failed to parse content to type\",\"caused_by\":{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0xac\\n at [Source: (byte[])\\\"{\\\"@version\\\":\\\"1\\\",\\\"host\\\":{\\\"name\\\":\\\"MacBook-Pro.local\\\"},\\\"@timestamp\\\":\\\"2024-03-15T22:22:18.892422Z\\\",\\\"message\\\":\\\"�\\\",\\\"event\\\":{\\\"original\\\":\\\"Hello world!\\\",\\\"sequence\\\":0},\\\"data_stream\\\":{\\\"type\\\":\\\"logs\\\",\\\"dataset\\\":\\\"generic\\\",\\\"namespace\\\":\\\"default\\\"}}\\\"; line: 1, column: 117]\"}},\"status\":400}"}

# HTTP_COMPRESSION=false bin/logstash -f config/encoding_test.conf
{
        "host" => {
        "name" => "MacBook-Pro.local"
    },
         "event" => {
        "original" => "Hello world!",
        "sequence" => 0
    },
      "@version" => "1",
       "message" => "\xAC",
    "@timestamp" => 2024-03-15T22:27:03.706976Z
}

Acceptance Criteria

Regardless of HTTP compression mode, the behaviour should stay same, either reject or accept. The possible better option would be considering the acceptance as it may provide benefits in many ways to users. However, filtering out of invalid byte sequence would be a bit dangerous.