UTF-8 validation isn't applied to CBOR text strings nested inside an indefinite-length text string

benluddy commented 1 year ago

From https://www.rfc-editor.org/rfc/rfc8949.html#section-3.2.3:

If any definite-length text string inside an indefinite-length text string is invalid, the indefinite-length text string is invalid. Note that this implies that the UTF-8 bytes of a single Unicode code point (scalar value) cannot be spread between chunks: a new chunk of a text string can only be started at a code point boundary.

Currently, when ValidateUnicode is set, the indefinite-length string is validated as UTF-8 only after all chunks have been concatenated. I have a test that spreads one code point across two chunks here: https://github.com/benluddy/ugorji-go/commit/c38a86cde35370b6b00be0e15406e12593c95ee4, which fails with:

--- FAIL: TestCborIndefiniteLengthTextStringChunksAreUTF8 (0.00s)
    cbor_test.go:126: expected error but decoded to: "£"

ugorji commented 1 year ago

Fixed with f7f63a0a821cb85bc908002b89754aa954ed76ea

ugorji commented 1 year ago

Fixed with f7f63a0a821cb85bc908002b89754aa954ed76ea

ugorji / go

UTF-8 validation isn't applied to CBOR text strings nested inside an indefinite-length text string #404