ugorji / go

idiomatic codec and rpc lib for msgpack, cbor, json, etc. msgpack.org[Go]
MIT License
1.83k stars 294 forks source link

Suspected data corruption issue in v1.2.3 and v1.2.4 #357

Closed jim-minter closed 3 years ago

jim-minter commented 3 years ago

Hi @ugorji I suspect a race condition which is causing us data corruption with concurrent JSON decodes/encodes of a type implementing codec.InterfaceExt . I think v1.2.3 and v1.2.4 may have the bug; v1.2.2 and master may not.

This is the test case I'm working on at the moment - it is not minimal at this stage, but I hope the data race is clear enough for you to see and understand the issue. The ugorji/go version used in this report is v1.2.4.

package main

import (
    "context"
    "encoding/json"
    "fmt"
    "reflect"
    "testing"

    "github.com/ugorji/go/codec"

    "github.com/Azure/ARO-RP/pkg/api"
    "github.com/Azure/ARO-RP/pkg/database"
    "github.com/Azure/ARO-RP/pkg/util/encryption"
)

func f(i int, h *codec.JsonHandle) error {
    root := &api.OpenShiftClusterDocument{
        OpenShiftCluster: &api.OpenShiftCluster{
            Properties: api.OpenShiftClusterProperties{
                AROServiceKubeconfig: api.SecureBytes(fmt.Sprintf("%d", i)),
                CreatedBy:            fmt.Sprintf("%d", i),
            },
        },
    }

    var wantDoc *api.OpenShiftClusterDocument
    for {
        var b []byte
        e := codec.NewEncoderBytes(&b, h)
        err := e.Encode(root)
        if err != nil {
            return err
        }

        var doc *api.OpenShiftClusterDocument
        d := codec.NewDecoderBytes(b, h)
        err = d.Decode(&doc)
        if err != nil {
            return err
        }

        if wantDoc == nil {
            wantDoc = doc
        } else {
            if !reflect.DeepEqual(wantDoc, doc) {
                wantDocB, _ := json.Marshal(wantDoc)
                docB, _ := json.Marshal(doc)
                return fmt.Errorf("%d\nwant:\n%s\n%s\ngot:\n%s\n%s",
                    i,
                    string(wantDoc.OpenShiftCluster.Properties.AROServiceKubeconfig),
                    string(wantDocB),
                    string(doc.OpenShiftCluster.Properties.AROServiceKubeconfig),
                    string(docB),
                )
            }
        }
    }
}

func TestDB(t *testing.T) {
    aead, err := encryption.NewXChaCha20Poly1305(context.Background(), make([]byte, 32))
    if err != nil {
        t.Fatal(err)
    }

    h, err := database.NewJSONHandle(aead)
    if err != nil {
        t.Fatal(err)
    }

    errch := make(chan error)
    for i := 0; i < 2; i++ {
        go func(i int) {
            errch <- f(i, h)
        }(i)
    }

    t.Fatal(<-errch)
}

Here is the output with v1.2.4:

$ go test -race ./test/
==================
WARNING: DATA RACE
Write at 0x000002a0c620 by goroutine 10:
  github.com/ugorji/go/codec.rvZeroAddrTransientAnyK()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/helper_unsafe.go:284 +0xf1
  github.com/ugorji/go/codec.rvZeroAddrTransientK()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/helper_unsafe.go:293 +0x4cd
  github.com/ugorji/go/codec.(*Decoder).interfaceExtConvertAndDecode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:2017 +0x424
  github.com/ugorji/go/codec.(*jsonDecDriver).DecodeExt()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/json.go:995 +0x224
  github.com/ugorji/go/codec.(*Decoder).ext()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:322 +0x1be
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1776 +0x37d
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).decode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1732 +0x1dc
  github.com/ugorji/go/codec.(*Decoder).MustDecode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1544 +0xf4
  github.com/ugorji/go/codec.(*Decoder).Decode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1529 +0xa4
  github.com/Azure/ARO-RP/test.f()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:37 +0x4ab
  github.com/Azure/ARO-RP/test.TestDB.func1()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:74 +0x44

Previous write at 0x000002a0c620 by goroutine 11:
  github.com/ugorji/go/codec.rvZeroAddrTransientAnyK()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/helper_unsafe.go:284 +0xf1
  github.com/ugorji/go/codec.rvZeroAddrTransientK()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/helper_unsafe.go:293 +0x4cd
  github.com/ugorji/go/codec.(*Decoder).interfaceExtConvertAndDecode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:2017 +0x424
  github.com/ugorji/go/codec.(*jsonDecDriver).DecodeExt()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/json.go:995 +0x224
  github.com/ugorji/go/codec.(*Decoder).ext()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:322 +0x1be
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1776 +0x37d
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).kStruct()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:653 +0x107c
  github.com/ugorji/go/codec.(*Decoder).decodeValueNoCheckNil()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1783 +0x4d9
  github.com/ugorji/go/codec.(*Decoder).decodeValue()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1750 +0xc9
  github.com/ugorji/go/codec.(*Decoder).decode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1732 +0x1dc
  github.com/ugorji/go/codec.(*Decoder).MustDecode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1544 +0xf4
  github.com/ugorji/go/codec.(*Decoder).Decode()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/vendor/github.com/ugorji/go/codec/decode.go:1529 +0xa4
  github.com/Azure/ARO-RP/test.f()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:37 +0x4ab
  github.com/Azure/ARO-RP/test.TestDB.func1()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:74 +0x44

Goroutine 10 (running) created at:
  github.com/Azure/ARO-RP/test.TestDB()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:73 +0x292
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:1123 +0x202

Goroutine 11 (running) created at:
  github.com/Azure/ARO-RP/test.TestDB()
      /data/home/jminter/go/src/github.com/Azure/ARO-RP/test/test_test.go:73 +0x292
  testing.tRunner()
      /usr/lib/golang/src/testing/testing.go:1123 +0x202
==================
--- FAIL: TestDB (0.02s)
    test_test.go:78: 0
        want:
        0
        {"openShiftCluster":{"properties":{"createdAt":"0001-01-01T00:00:00Z","createdBy":"0","clusterProfile":{},"consoleProfile":{},"servicePrincipalProfile":{},"networkProfile":{},"masterProfile":{},"apiserverProfile":{},"aroServiceKubeconfig":"MA=="}}}
        got:
        1
        {"openShiftCluster":{"properties":{"createdAt":"0001-01-01T00:00:00Z","createdBy":"0","clusterProfile":{},"consoleProfile":{},"servicePrincipalProfile":{},"networkProfile":{},"masterProfile":{},"apiserverProfile":{},"aroServiceKubeconfig":"MQ=="}}}
    testing.go:1038: race detected during execution of test
FAIL
FAIL    github.com/Azure/ARO-RP/test    0.076s
FAIL

I think it would be good to consider adding more test coverage to prevent a regression, creating a v1.2.5 release and perhaps marking v1.2.3 and v1.2.4 as bad.

jim-minter commented 3 years ago

Yikes, using master (v1.2.5-0.20210320190651-a2bb12368408) if I increase the concurrency of the above test from 2 to 100, I see:

runtime: pointer 0xc0006c0880 to unused region of span span.base()=0xc00085a000 span.limit=0xc00085c000 span.state=1
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

I don't seem to see this with v1.2.2, or with master and -tags safe

jim-minter commented 3 years ago

Hmm, I think I've got myself confused with the precise versions that work and don't. Here is a simplified test with no dependencies; this seems to pass here with v1.2.2 and v1.2.3, but not v1.2.4 or master. When n = 1 (no concurrency), it passes on v1.2.4 and master.

package ugorji

import (
    "bytes"
    "encoding/base64"
    "fmt"
    "reflect"
    "sync/atomic"
    "testing"
    "time"

    "github.com/ugorji/go/codec"
)

type secureBytes []byte

type secureBytesExt struct{}

func (secureBytesExt) ConvertExt(v interface{}) interface{} {
    // encryption code removed

    return base64.StdEncoding.EncodeToString(v.(secureBytes))
}

func (secureBytesExt) UpdateExt(dest interface{}, v interface{}) {
    // decryption code removed

    b, err := base64.StdEncoding.DecodeString(v.(string))
    if err != nil {
        panic(err)
    }

    *dest.(*secureBytes) = b
}

var done int32

func testExtensionsConcurrency(i int, h *codec.JsonHandle) error {
    want := secureBytes(fmt.Sprintf("%d", i))

    for {
        var b []byte
        e := codec.NewEncoderBytes(&b, h)
        err := e.Encode(want)
        if err != nil {
            return err
        }

        var got secureBytes
        d := codec.NewDecoderBytes(b, h)
        err = d.Decode(&got)
        if err != nil {
            return err
        }

        if !bytes.Equal(want, got) {
            return fmt.Errorf("%d: want: %s, got: %s", i, string(want), string(got))
        }

        // reading from a closed channel perturbs the scheduling too much and
        // masks unsafe errors we're currently seeing in ugorji/go master
        if atomic.LoadInt32(&done) != 0 {
            break
        }
    }

    return nil
}

func TestExtensionsConcurrency(t *testing.T) {
    const n = 100

    h := &codec.JsonHandle{
        BasicHandle: codec.BasicHandle{
            DecodeOptions: codec.DecodeOptions{
                ErrorIfNoField: true,
            },
        },
    }

    err := h.SetInterfaceExt(reflect.TypeOf(secureBytes{}), 1, secureBytesExt{})
    if err != nil {
        t.Fatal(err)
    }

    errch := make(chan error, n)
    for i := 0; i < n; i++ {
        go func(i int) {
            errch <- testExtensionsConcurrency(i, h)
        }(i)
    }

    time.AfterFunc(5*time.Second, func() {
        atomic.StoreInt32(&done, 1)
    })

    for i := 0; i < n; i++ {
        err = <-errch
        if err != nil {
            t.Fatal(err)
        }
    }
}
ugorji commented 3 years ago

Bad bug. Great catch. And thanks for the excellent test case that I could quickly use to reproduce the issue.

Fix coming soon, and I will tag this with need test so I remember to put in a test case.

ugorji commented 3 years ago

@jim-minter please let me know that it resolves the problem on your end, and I will cut a new release this weekend.

ugorji commented 3 years ago

BTW the race issue you found was fixed by: 76603559030f291b11839a54bae70045f07d583f

jim-minter commented 3 years ago

Hi @ugorji, using 11d01daad36cb5b406765bc9936bd96e3e18cb31 I am no longer able to trigger the bug, the race detector or the "found bad pointer in Go heap" error, so this lgtm. Thanks for the quick turnaround!

Looking at it, I'm a little wary of the unsafe code now, and I'm wondering whether we might start building with -tags codec.safe. I think that less performance and more safety is probably a better tradeoff for what we're doing. Regardless of that statement, thanks so much for this project! Among other things, MissingFielder is priceless.

ugorji commented 3 years ago

Looking at it, I'm a little wary of the unsafe code now, and I'm wondering whether we might start building with -tags codec.safe. I think that less performance and more safety is probably a better tradeoff for what we're doing. Regardless of that statement, thanks so much for this project! Among other things, MissingFielder is priceless.

Treating "transient" values is the only place I know that might interact negatively with the GC. The reason is that we were using the same memory space for things that looked like pointers and things that didn't, leading GC to track an allocated memory space as a pointer, and barf when it saw something else there.

The changes fix this: we now only use transient memory space for numbers/bool and struct/array with no internal pointers, and use string/slice shaped value for the string/slice case.

The main benefit is that it can dramatically reduce allocations, which would have been stack-allocated in non-reflection mode.

My advice: you know your risk tolerance, but please keep using it and help flesh out any issues. I don't anticipate any more, but best to find them all.

Thanks.

ugorji commented 3 years ago

@jim-minter FYI I cut v1.2.5 today - see https://github.com/ugorji/go/releases/tag/v1.2.5 and https://pkg.go.dev/github.com/ugorji/go/codec