redpanda-data / connect

Fancy stream processing made operationally mundane
https://docs.redpanda.com/redpanda-connect/about/
8.1k stars 824 forks source link

bug: key_values() order is nondeterministic #1391

Open gena01 opened 2 years ago

gena01 commented 2 years ago

While putting together unit tests ran into this issue where key_values() doesn't always return the same order, which makes unit tests fail randomly.

input:
  label: file
  file:
    paths:
      - ./input1.json
      - ./input2.json

pipeline:
  processors:
    - label: rewrite_message
      bloblang: |
          root = this.locations.map_each(line -> {
                    "state": line.state,
                     "location": {
                        "id": line.id,
                        "name": line.name
                    }
                 })
    - label: split
      unarchive:
        format: json_array

output:
  broker:
    pattern: fan_out
    outputs:
    - stdout:
        codec: lines

    batching:
      byte_size: 1024
      period: 100ms

      processors:
      - label: join
        archive:
          format: json_array

      - label: merge_state_groups
        bloblang: |
          root = {
            "time": now().format_timestamp(tz: "UTC"),
            "data": this.map_each(msg -> {msg.state: [msg.location]}).squash().
                               key_values().map_each(kv -> {"state":kv.key,"locations":kv.value})
          }

And here is the unit test:

tests:
  - name: output processor test
    target_processors: /output/broker/batching/processors
    input_batch:
      - json_content: |
          {"location":{"id":1,"name":"New York"},"state":"NY"}
      - json_content: |
          {"location":{"id":2,"name":"Bellevue"},"state":"WA"}
      - json_content: |
          {"location":{"id":3,"name":"Olympia"},"state":"WA"}
      - json_content: |
          {"location":{"id":4,"name":"Seattle"},"state":"WA"}

    output_batches:
      - - json_contains: |
            {
            "data": [
                    {
                        "locations": [
                            {
                                "id": 1,
                                "name": "New York"
                            }
                        ],
                        "state": "NY"
                    },
                    {
                        "locations": [
                            {
                                "id": 2,
                                "name": "Bellevue"
                            },
                            {
                                "id": 3,
                                "name": "Olympia"
                            },
                            {
                                "id": 4,
                                "name": "Seattle"
                            }
                        ],
                        "state": "WA"
                    }
                ]
            }
mihaitodor commented 2 years ago

Thanks for raising this! I believe json_contains should take an extra optional parameter which allows users to instruct it to ignore element order in arrays.