goccy / go-yaml

YAML support for the Go language
MIT License
1.12k stars 129 forks source link

Tokenizer does not handle double quoted strings well #455

Open nieomylnieja opened 2 months ago

nieomylnieja commented 2 months ago

Describe the bug

lexer.Tokenize() produces invalid tokens when given certain double quoted strings. The issue comes apparent with double quoted JSON containing : character on certain positions. The faulty behaviour comes from this line: https://github.com/nobl9/go-yaml/blob/b2a8cc696a9efea74ec54695625870937ce24797/scanner/scanner.go#L837

if ctx.currentCharWithSkipWhitespace() == ':' {
    continue
}

The reason why it passes this if statement is because the ctx.idx is not updated properly. The index points at a different column than the actual end of the double quoted string. This seems like a bug in the ctx.idx updating of the scanDoubleQuote function, but I might not be seeing/understanding the reason why ctx.idx is not updated uniformly in this function.

The faulty tokens dump looks as follows (the last token should not appear at all):

- &{Type:String CharacterType:Miscellaneous Indicator:NotIndicator Value:json Origin:json Position:[level:0,line:1,column:1,offset:1] Next:0xc0000ae050 Prev:<nil>}
- &{Type:MappingValue CharacterType:Indicator Indicator:BlockStructure Value:: Origin:: Position:[level:0,line:1,column:5,offset:5] Next:0xc0000ae0a0 Prev:0xc0000ae000}
- &{Type:DoubleQuote CharacterType:Indicator Indicator:QuotedScalar Value:"expression": "thi:" Origin: "\"expression\": \"thi:\"" Position:[level:0,line:1,column:7,offset:7] Next:0xc0000ae0f0 Prev:0xc0000ae050}
- &{Type:String CharacterType:Miscellaneous Indicator:NotIndicator Value::\"" Origin: "\"expression\": \"thi:\"":\"" Position:[level:0,line:1,column:29,offset:29] Next:<nil> Prev:0xc0000ae0a0}

To Reproduce

json: "\"expression\": \"thi:\"

Expected behavior

The tokenizer produces correct tokens set.

Version Variables