magnusbaeck / logstash-filter-verifier

Apache License 2.0
191 stars 27 forks source link

lfv_fields_not_found error when using codec that outputs multiple events #147

Open jgough opened 2 years ago

jgough commented 2 years ago

When building from source from master, when a codec is specified that outputs multiple events then the fields.json file that is generated may not have enough fields, resulting in a lfv_fields_not_found tag.

For example with the following input:

  stdin {
    id => "input"
    codec => multiline {
      pattern => "^\s"
      what => "previous"
    }
  }

and the following test:

input_plugin: "input"
fields:
  test: true
ignore:
  - "@timestamp"
testcases:
  - input:
    - |
      This is message 1 and
        this is message 1 also
      This is message 2 and
        this is message 2 also
    expected:
      - message: "This is message 1 and\n  this is message 1 also"
        tags: ["multiline"]
        test: true
      - message: "This is message 2 and\n  this is message 2 also"
        tags: ["multiline"]
        test: true

The following is output:

☑ Comparing message 1 of 2 from test1.yml
☐ Comparing message 2 of 2 from test1.yml:
--- /tmp/1371322176/test1.yml/2/expected        2021-09-24 16:59:01.007682196 +0000
+++ /tmp/1371322176/test1.yml/2/actual  2021-09-24 16:59:01.007682196 +0000
@@ -1,7 +1,7 @@
 {
   "message": "This is message 2 and\n  this is message 2 also",
   "tags": [
-    "multiline"
-  ],
-  "test": true
+    "multiline",
+    "lfv_fields_not_found"
+  ]
 }

Summary: ☐ All tests: 1/2
         ☐ test1.yml: 1/2
logstash-filter-verifier: error: failed test cases

This seems to be because fields.json is set to contain {"0":{"test":true}} and there is no second value for the second message. I know LFV has no concept that there are two events here. Is there any solution to this problem?

You could argue that this should be two different testcase input values, but given logstash produced two outputs for one input here it could be argued that this should be supported. What are your thoughts?

Dockerfile to repro:

# syntax=docker/dockerfile:1.3-labs
FROM golang:1.17 as lfv_builder

RUN git clone https://github.com/magnusbaeck/logstash-filter-verifier.git
RUN apt update && apt install -y protobuf-compiler
RUN useradd -m -s /bin/bash go && chown -R go /go

USER go
WORKDIR /go/logstash-filter-verifier
RUN make && make check test

#------------------------------------------------------#

FROM docker.elastic.co/logstash/logstash:7.10.2

COPY --from=lfv_builder /go/logstash-filter-verifier /usr/bin

RUN <<EOF
mkdir tests
mkdir pipeline/pipeline1

cat <<EOT > /usr/share/logstash/config/pipelines.yml
- pipeline.id: pipeline1
  path.config: "pipeline/pipeline1/*.conf"
EOT

cat <<EOT > /usr/share/logstash/tests/test1.yml
input_plugin: "input"
fields:
  test: true
ignore:
  - "@timestamp"
testcases:
  - input:
    - |
      This is message 1 and
        this is message 1 also
      This is message 2 and
        this is message 2 also
    expected:
      - message: "This is message 1 and\n  this is message 1 also"
        tags: ["multiline"]
        test: true
      - message: "This is message 2 and\n  this is message 2 also"
        tags: ["multiline"]
        test: true
EOT

cat <<EOT > /usr/share/logstash/pipeline/pipeline1/01-input.conf
input {
  stdin {
    id => "input"
    codec => multiline {
      pattern => "^\s"
      what => "previous"
    }
  }
}

output { stdout {} }
EOT

cat <<EOT > /usr/share/logstash/run_tests.sh
logstash-filter-verifier daemon start --no-cleanup &
sleep 5
logstash-filter-verifier daemon run --pipeline /usr/share/logstash/config/pipelines.yml --pipeline-base /usr/share/logstash/ --testcase-dir /usr/share/logstash/tests/test1.yml --add-missing-id
EOT

EOF

CMD ["/bin/bash", "/usr/share/logstash/run_tests.sh"]
jgough commented 2 years ago

Further thoughts on this - really what I'm trying to test here is that my multiline codec works correctly. By splitting up into multiple tests I'm not really testing this in a way that makes logical sense to me (although it may indeed be equivalent)

breml commented 2 years ago

This might be related to #149. The generator input approach does not work well for cases with codecs as well as with inputs that have "special" characters (from the Logstash config format point of view). As mentioned in #149, we might need to replace the generator input with something else (e.g. the file input in read mode).

nicolasreich commented 2 years ago

I encountered a similar issue trying to test a configuration looking like this:

input {
  file {
    path => "/var/log/db_logs_logstash/*.csv"
    mode => "read"
    file_completed_action => "delete"
    id => "file"
  }
}

filter {
  csv {
      separator => ";"
      autodetect_column_names => true
      skip_empty_columns => true
      id => csv
  }
}

It worked by setting the input over two lines, i.e. looking like this:

---
ignore:
  - "@timestamp"
testcases:
  - input:
      - "TYP_INST;DBNAME;OS_USERNAME;USERNAME;USERHOST;TERMINAL;LOGON_TIME;LOGOFF_TIME;OS_PROCESS;CLIENT_ID"
      - "Prod;myBeautifulDB;myOS;myUsername;myUserhost;;22/08/2021 14:14:43;22/08/2021 14:14:43;4468;"
    expected:
      - message: "Prod;myBeautifulDB;myOS;myUsername;myUserhost;;22/08/2021 14:14:43;22/08/2021 14:14:43;4468;"
        LOGSTASH_DB_LOGS: "true"
        TYP_INST: "Prod"
        DBNAME: "myBeautifulDB"
        OS_USERNAME: "myOS"
        USERNAME: "myUsername"
        USERHOST: "myUserhost"
        LOGON_TIME: "22/08/2021 14:14:43"
        LOGOFF_TIME: "22/08/2021 14:14:43"
        OS_PROCESS: "4468"
input_plugin: file

So in the example from the initial post, it would be:

input_plugin: "input"
fields:
  test: true
ignore:
  - "@timestamp"
testcases:
  - input:
    - "This is message 1 and"
    - "  this is message 1 also"
    - "This is message 2 and"
    - "  this is message 2 also"
    expected:
      - message: "This is message 1 and\n  this is message 1 also"
        tags: ["multiline"]
        test: true
      - message: "This is message 2 and\n  this is message 2 also"
        tags: ["multiline"]
        test: true

Although I agree it is somewhat counterintuitive.

breml commented 2 years ago

x input lines can lead to y input events can lead to z output events

There is no direct relationship between:

The number of events emitted from Logstash (z) can be evaluated from the number of expected events. But there is currently not yet a good way to reliably determine the number of events, that will be generated from the input plugin (including the codec). This makes it hard to prepare the correct number of fields, that should be added to the event.

Maybe it is necessary to separate the logic to add the global fields (to the event) from the per test case fields. Additionally it might become necessary to create a separate input for each test case.

Related #142

breml commented 2 years ago

In #156 I currently have a version, which handles multiline codec better and also correctly maps the fields to the correct input by having a separate input plugin for each input.

The down-side of this approach is, that for a multiline input, always the following form needs to be used:

  - input:
      - |-
        TYP_INST;DBNAME;OS_USERNAME;USERNAME;USERHOST;TERMINAL;LOGON_TIME;LOGOFF_TIME;OS_PROCESS;CLIENT_ID
        Prod;myBeautifulDB;myOS;myUsername;myUserhost;;22/08/2021 14:14:43;22/08/2021 14:14:43;4468;

this form does no longer work (because the two lines are not processed by the same input and therefore not by the same codec):

  - input:
      - TYP_INST;DBNAME;OS_USERNAME;USERNAME;USERHOST;TERMINAL;LOGON_TIME;LOGOFF_TIME;OS_PROCESS;CLIENT_ID
      - Prod;myBeautifulDB;myOS;myUsername;myUserhost;;22/08/2021 14:14:43;22/08/2021 14:14:43;4468;

As mentioned in the previous comment, there are a lot of different combinations, that work against each other and it sort of looks like there is no solution, where all the combinations work.

jgough commented 2 years ago

The first format makes more logical sense to me, all those lines being part of one input. Does this mean that in my test above you would have to use:

    - |
      This is message 1 and
        this is message 1 also
    - |
      This is message 2 and
        this is message 2 also

For two inputs?