JSON parsing error when there is a single quote in a YAML multiline string

nicolasreich commented 2 years ago

Description

We make heavy use of the YAML multiline string format for the input of tests, to be able to define them in readable JSON.

However, with LFV2 (and the necessary adaptations), there are json errors when there's a single quote in the input, which wasn't the case with LFV1, and the test fails (example below).

Exemplification

Logstash config:

input {
  stdin {
    id => json_lines
    codec => json
  }
}

filter {
  mutate { 
    rename => { "before" => "after" }
    id => "mutate"
  }
}
output {
  stdout {
    id => "stdout"
  }
}

Test 1:

---
ignore:
  - "@timestamp"
  - host
input_plugin: "json_lines"
testcases:
  - input:
      - >
        {
          "before": "hello"
        }
    expected:
      - after: "hello"

Test 1 output:

☑ Comparing message 1 of 1 from mutate.test.yaml                                                                                                                                              

Summary: ☑ All tests: 1/1                                                                                                                                                                     
         ☑ mutate.test.yaml: 1/1

Test 2:

---
ignore:
  - "@timestamp"
  - host
input_plugin: "json_lines"
testcases:
  - input:
      - >
        {
          "before": "'hello'"
        }
    expected:
      - after: "hello"

(Same as Test1 with single quotes in the before field)

Test2 output:

Test input line "{\n  \"before\": \"'hello'\"\n}\n" contains unescaped double and single quotes, single quotes will be escaped automatically
☐ Comparing message 1 of 1 from mutate2.test.yaml:                                                                                                                                            
--- /tmp/292855641/mutate2.test.yaml/1/expected 2021-10-04 11:52:49.843940479 +0200                                                                                                           
+++ /tmp/292855641/mutate2.test.yaml/1/actual   2021-10-04 11:52:49.843940479 +0200                                                                                                           
@@ -1,3 +1,6 @@                                                                                                                                                                               
 {                                                                                                                                                                                            
-  "after": "hello"                                                                                                                                                                           
+  "message": "{\n  \"before\": \"\\'hello\\'\"\n}\n",                                                                                                                                        
+  "tags": [                                                                                                                                                                                  
+    "_jsonparsefailure"                                                                                                                                                                      
+  ]                                                                                                                                                                                          
 }                                                                                                                                                                                            

Summary: ☐ All tests: 0/1                                                                                                                                                                     
         ☐ mutate2.test.yaml: 0/1                                                                                                                                                             
logstash-filter-verifier: error: failed test cases

Benefits

The ability to test data containing single quotes.

Possible Drawbacks

It might be hard to implement cleanly. Maybe there is already a way to do this, but my tests are poorly designed.

breml commented 2 years ago

This one is related to https://github.com/elastic/logstash/issues/1645. The problem is, that back in the days, the escaping in the Logstash config format was not available. Since https://github.com/elastic/logstash/pull/7442, there is a way to enable proper escaping for the Logstash config format, but this is disabled by default for historical reasons and therefore this is also disabled for LFV. The problem is, that whenever we have a string with " or ' we need to decide, how to enclose the string (see the examples in the linked issue).

In LFV 1 this is not a problem, because the input is passed as literal string (via stdin or unix socket) and not via the Logstash configuration.

Would maybe the fields field in the test case configuration be an option for you in this case?

breml commented 2 years ago

I just learned, that the logstash file input does support multiple mode (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html#plugins-inputs-file-mode), one being read (instead of tail). This feature was introduced 3 years ago and did not exist when I checked last. So we might be able to fix this issue by replacing the generator input with a file input.

breml commented 2 years ago

I quickly check since when the above mentioned mode feature in the file input is present. May tests show, that it is present at least from version 6.7.2 as well as 7.0.0, so I guess we would be good to go for this approach.

That being said, I feel like the necessary refactoring would be pretty massive.

breml commented 2 years ago

I created a PR to fix this issue. While working on the PR, I realized a nasty detail about yaml.

- >
  {
    "before": "hello"
  }

translates to

[
  "{\n  \"before\": \"hello\"\n}\n"
]

while

- >
  {
  "before": "hello"
  }

translates to

[
  "{ \"before\": \"hello\" }\n"
]

The difference is subtle, in the first case, there are actually 3 lines of input while in the second case it is only one line (and therefore really json_lines.

I made it work for both cases, but this has cost me some more gray hairs.

marcus-crane commented 9 months ago

The difference is subtle, in the first case, there are actually 3 lines of input while in the second case it is only one line (and therefore really json_lines.

Thanks for pointing this out! I had been following the multi-line test case example in the README which indents fields only to get one event per field and it was driving me crazy 😅

magnusbaeck / logstash-filter-verifier