minio / simdjson-go

Golang port of simdjson: parsing gigabytes of JSON per second
Apache License 2.0
1.8k stars 85 forks source link

A premature end when walking over objects whose values are arrays? #83

Closed adamroyjones closed 1 year ago

adamroyjones commented 1 year ago

This is probably user error, but I'm bumping into issues with the following code of mine. (I'll describe the issues at the bottom.)

package main

import (
    "testing"

    simdjson "github.com/minio/simdjson-go"
    "github.com/stretchr/testify/require"
)

func censor(in string) (string, error) {
    pj, err := simdjson.Parse([]byte(in), nil)
    if err != nil {
        return "", err
    }

    root := pj.Iter()
    iter := root
TAPELOOP:
    for {
        switch iter.AdvanceInto().Type() {
        case simdjson.TypeNone:
            break TAPELOOP
        case simdjson.TypeString:
            val, err := iter.String()
            if err != nil {
                return "", err
            }
            if val[0] == 'a' {
                iter.SetString("x")
            }
        default:
            continue
        }
    }

    ret, err := root.MarshalJSON()
    if err != nil {
        return "", err
    }

    return string(ret), nil
}

func TestCensor(t *testing.T) {
    type testCase struct {
        name   string
        in     string
        expOut string
    }

    testCases := []testCase{
        {
            name:   "naked_string",
            in:     `"foo"`,
            expOut: `"foo"`,
        },
        {
            name:   "homogeneous_object,_no_match",
            in:     `{"foo": "bar", "baz": "qux"}`,
            expOut: `{"foo": "bar", "baz": "qux"}`,
        },
        {
            name:   "homogeneous_object,_match",
            in:     `{"afoo": "bar", "baz": "aqux"}`,
            expOut: `{"x": "bar", "baz": "x"}`,
        },
        {
            name:   "inhomogeneous_object,_no_match",
            in:     `{"foo": ["bar"], "baz": "qux"}`,
            expOut: `{"foo": ["bar"], "baz": "qux"}`,
        },
        {
            name:   "inhomogeneous_object,_match",
            in:     `{"aoo": ["bar"], "baz": "aqux"}`,
            expOut: `{"x": ["bar"], "baz": "x"}`,
        },
        {
            name:   "homogeneous_object,_array_values,_no_match",
            in:     `{"foo": ["bar"], "baz": ["qux"]}`,
            expOut: `{"foo": ["bar"], "baz": ["qux"]}`,
        },
        {
            name:   "homogeneous_object,_array_values,_match",
            in:     `{"afoo": ["bar"], "baz": ["aqux"]}`,
            expOut: `{"x": ["bar"], "baz": ["x"]}`,
        },
    }

    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            out, err := censor(tc.in)
            require.NoError(t, err)
            require.JSONEq(t, tc.expOut, out)
        })
    }
}

Running the tests above leads to:

--- FAIL: TestCensor (0.00s)
    --- FAIL: TestCensor/naked_string (0.00s)
        main_test.go:92:
                Error Trace:    /home/adamjones/tmp/simdjson-inhomogeneous/main_test.go:92
                Error:          Received unexpected error:
                                Failed to find all structural indices for stage 1
                Test:           TestCensor/naked_string
    --- FAIL: TestCensor/inhomogeneous_object,_match (0.00s)
        main_test.go:93:
                Error Trace:    /home/adamjones/tmp/simdjson-inhomogeneous/main_test.go:93
                Error:          Not equal:
                                expected: map[string]interface {}{"baz":"x", "x":[]interface {}{"bar"}}
                                actual  : map[string]interface {}{"baz":"aqux", "x":[]interface {}{"bar"}}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,3 +1,3 @@
                                 (map[string]interface {}) (len=2) {
                                - (string) (len=3) "baz": (string) (len=1) "x",
                                + (string) (len=3) "baz": (string) (len=4) "aqux",
                                  (string) (len=1) "x": ([]interface {}) (len=1) {
                Test:           TestCensor/inhomogeneous_object,_match
    --- FAIL: TestCensor/homogeneous_object,_array_values,_match (0.00s)
        main_test.go:93:
                Error Trace:    /home/adamjones/tmp/simdjson-inhomogeneous/main_test.go:93
                Error:          Not equal:
                                expected: map[string]interface {}{"baz":[]interface {}{"x"}, "x":[]interface {}{"bar"}}
                                actual  : map[string]interface {}{"baz":[]interface {}{"aqux"}, "x":[]interface {}{"bar"}}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -2,3 +2,3 @@
                                  (string) (len=3) "baz": ([]interface {}) (len=1) {
                                -  (string) (len=1) "x"
                                +  (string) (len=4) "aqux"
                                  },
                Test:           TestCensor/homogeneous_object,_array_values,_match

The intention with the example code is to walk the JSON and censor any matching strings it finds. The issues are twofold.

  1. My reading of RFC 8259 is that naked strings are legitimate JSON texts ("A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array. A JSON value MUST be an object, array, number, or string, or one of the following three literal names [...]"), but parsing fails with the first example. Is that right?
  2. The tape walker looks like it hits a premature end (simdjson.TypeNone) at the end of the array in the cases where the values of the object are arrays. Is that right? Have I done something stupid? (Is there a way to continue stepping?)

This may well be user error in both cases—does anyone have any ideas?

adamroyjones commented 1 year ago

(I'm sorry for the noise above and below—I made some mistakes which I've tried to tidy away.)

klauspost commented 1 year ago
  1. We do not have plans to support single value JSON.
  2. TypeNone is returned when an object or array ends at the current level.

So when you get TypeNone, the array ends and you must continue parsing the object. When using bare recursive iterators you will need to keep track of your parse depth.

I added an @ where a TypeNone will be returned: {"foo": ["bar"]@, "baz": {"qux": "bah"}@}@@

Or better, use Object().ForEach and Array() to parse arrays and objects.

Here is a simple fix:

func censor(in string) (string, error) {
    pj, err := simdjson.Parse([]byte(in), nil)
    if err != nil {
        return "", err
    }

    root := pj.Iter()
    iter := root
    var depth int
TAPELOOP:
    for {
        switch iter.AdvanceInto().Type() {
        case simdjson.TypeNone:
            if depth == 0 {
                break TAPELOOP
            }
            depth--
        case simdjson.TypeArray, simdjson.TypeObject:
            depth++
        case simdjson.TypeString:
            val, err := iter.String()
            if err != nil {
                return "", err
            }
            if val[0] == 'a' {
                iter.SetString("x")
            }
        default:
            continue
        }
    }

    ret, err := root.MarshalJSON()
    if err != nil {
        return "", err
    }

    return string(ret), nil
}
adamroyjones commented 1 year ago
  1. I understand.
  2. That's enormously helpful—I was being totally dim-witted for not seeing that. Thanks for your help and sorry for wasting your time!
klauspost commented 1 year ago

@adamroyjones Using bare iterators is not easy. That is why there are abstractions for handling objects and arrays.

When using "AdvanceInto" you are basically traversing as it is literally written in the JSON.