encoding/json: add line number to SyntaxError

AlexanderYastrebov commented 3 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What did you do?

package main

import (
    "encoding/json"
    "fmt"
    "io"
    "os"
)

func main() {
    dec := json.NewDecoder(os.Stdin)
    for {
        var m map[string]interface{}
        if err := dec.Decode(&m); err == io.EOF {
            break
        } else if err != nil {
            fmt.Printf("%#v\n", err)
            break
        }
    }
}

$ echo -e '{"x":\ny}' | go run jsonline.go 
&json.SyntaxError{msg:"invalid character 'y' looking for beginning of value", Offset:7}

At the moment SyntaxError provides an offset of the character that caused parsing error. The proposal is add the line number as well.

I could not find such an issue therefore opened this one, please close if it was already resolved.

mvdan commented 3 years ago

Could you expand a bit on why the line number would help? Naively, an offset is enough to work out the line number if you were parsing a chunk of bytes, a buffer, or a file. Just count the number of newlines up to the offset.

If you were parsing a stream of data, then I'm not convinced that a line number would actually help anyone. A line number is usually helpful for a human to go look at the input in the right place, but that is not possible if the stream isn't saved somewhere.

Keeping track of extra information like line numbers also has an inherent cost, so I want to make sure the arguments in favor are solid.

AlexanderYastrebov commented 3 years ago

I agree, it should be possible to count lines having input at hand and without input it does not makes much sense. I thought about it as of convenience feature but (after I've attempted to implement it) I am not sure its worth the cost of implementing/maintaining.

The proposal is coming from the usecase where parsing module does not have input at hand but bubbles up the error https://github.com/kubernetes/kubernetes/issues/97651

mvdan commented 3 years ago

encoding/json's syntax errors aren't very end-user-friendly, so it's entirely expected that a user-facing piece of software would wrap them or decorate them to better show the user what's going on. Calculating the line number if there is a syntax error should take just a couple of lines of code, so I agree it's not worth adding straight to SyntaxError.

networkimprov commented 3 years ago

It would be nice to have a type LineCountReader in io. It's easy to write yourself, but seems like a common utility.

AlexanderYastrebov commented 3 years ago

@networkimprov As far as I can tell line counter would not help in this case due to buffering around here https://github.com/golang/go/blob/3e1e13ce6d1271f49f3d8ee359689145a6995bad/src/encoding/json/stream.go#L146-L169 i.e. the parsing error might not necessary happen after last new line read.

mvdan commented 3 years ago

Indeed, most parsers/decoders read in big chunks, because doing one read for every single byte consumed would be incredibly expensive. So a io.Reader layer like @networkimprov suggests won't work.

mvdan commented 3 years ago

I don't think this needs the attention from the proposal review committee. They already have a backlog of hundreds of proposals as it is. We can make this decision amongst the encoding/json maintainers; CC @dsnet @rsc @bradfitz.

My opinion is still that we should reject this idea. If anyone has thoughts on a helper API to "prettify" json errors for end users, perhaps we could consider that, but one could always write that in a few lines of simple Go code.

networkimprov commented 3 years ago

Well two possible APIs are

dc := json.NewDecoder(fd)

dc.SetOpts(&json.DecodeOpts{...}) // applies to all Decode()s
err := dc.Decode(&v)

err := dc.DecodeOpts(&v, &json.DecodeOpts{...})

I'd use this, as my apps import long (hundreds of lines) test sequences from JSON files.

AlexanderYastrebov commented 3 years ago

Line counting without input at hand may be implemented like:

package main

import (
    "encoding/json"
    "fmt"
    "io"
    "os"
    "sort"
)

type (
    lineCounter struct {
        r       io.Reader
        d       *json.Decoder
        bytes   int64
        lines   int
        offsets []offset
    }

    offset struct {
        index  int
        offset int64
    }
)

func (c *lineCounter) Read(data []byte) (n int, err error) {
    // remove newline offsets less than input offset
    i := c.search(c.d.InputOffset())
    c.offsets = c.offsets[i:]

    n, err = c.r.Read(data)
    for i, b := range data[:n] {
        if b == '\n' {
            c.offsets = append(c.offsets, offset{c.lines, c.bytes + int64(i)})
            c.lines++
        }
    }
    c.bytes += int64(n)
    return
}

func (c *lineCounter) newlinesBefore(offset int64) int {
    if len(c.offsets) == 0 {
        return c.lines
    }
    i := c.search(offset)
    if i == len(c.offsets) {
        return c.lines
    }
    return c.offsets[i].index
}

func (c *lineCounter) search(offset int64) int {
    return sort.Search(len(c.offsets), func(i int) bool { return c.offsets[i].offset >= offset })
}

func main() {
    c := &lineCounter{r: os.Stdin}
    dec := json.NewDecoder(c)
    c.d = dec

    for {
        var m map[string]interface{}
        if err := dec.Decode(&m); err == io.EOF {
            break
        } else if se, ok := err.(*json.SyntaxError); ok {
            fmt.Printf("%#v\n", err)
            //fmt.Printf("%#v\n", c)
            fmt.Printf("Line: %d\n", c.newlinesBefore(se.Offset)+1)
            break
        }
    }
}

Not trivial but does not require Decoder change.

networkimprov commented 3 years ago

I updated my previous comment to suggest encoding/json type DecodeOpts struct {...}

dsnet commented 3 years ago

Line counting (given the entire input buffer as jsonData) from the offset is as simple as:

line := 1 + strings.Count(jsonData[:offset], "\n")
column := 1 + offset - (strings.LastIndex(jsonData[:offset], "\n") + len("\n"))

Given it's simplicity, it calls into question whether we must provide this feature. The main advantage of having encoding/json handling it is that it would work for io.Reader inputs as well, but it doesn't seem that difficult to copy the entire input to a []byte so that you have access to the original input after the fact.

Stepping back, I personally, have no problem with the concept of adding line:column information and even explored an implementation last month where that's provided [1]. However, as @mvdan mentioned earlier, the issue is the performance cost it imparts onto the Decoder implementation. It's one thing to advance through all whitespace characters, it's another to also count the number of newlines that occurs. I don't remember the numbers, but there was a non-trivial performance cost for the common case where the line:column information is not even used.

dc.CountLines(true)

I don't think another API endpoint just for more verbose errors justifies its addition. Also, it's not clear to me that this fully avoids any performance slow down. I've been regularly surprised by how simply checking a flag for whether some branch should be taken itself incurs a performance slow down.

Given the strong desire that we avoid slowing down decoding in the common case, I propose two possible approaches: 1) We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information. 2) Provide line and column information in SyntaxErrors only if the entire []byte input is provided since it's trivial to compute this after the fact. However, we don't do it for inputs reading from an io.Reader since we want to avoid buffering the entire input. However, I'm not entirely fond of the inconsistent behavior between []byte and io.Reader inputs.

[1] This is in reference to an entirely new Decoder implementation that is an order of magnitude faster than the current one. See my comment elsewhere. Today's implementation is slow enough that the cost of line-counting is probably overshadowed by other inefficiencies. However, the existence of other inefficiencies today should not be justification for adding a performance cost that can never be removed since it leaks to the API.

networkimprov commented 3 years ago

When reading a file or network message, you almost always use json.Decoder, not json.Unmarshal. It would be crazy in many cases to save the whole stream to a buffer in case of error.

I don't think a performance cost is unreasonable when the feature is triggered by an API. When you really can't afford the cost, you simply don't incur it. Note that I suggested an extensible API which can accommodate other decoding options. Maybe accept_comments? :-)

dsnet commented 3 years ago

When you really can't afford the cost, you simply don't incur it.

That's not quite true. We still need to check a flag for whether the feature is enabled or not and checking the flag itself is a non-zero cost.

Note that I suggested an extensible API which can accommodate other decoding options.

This is outside the scope of this specific issue and has been explored quite a bit in other issues related to encoding/json. It's likely that the json API will evolve towards an options struct, but there's a lot of other issues that need to be considered simultaneously.

networkimprov commented 3 years ago

Worst case, you check the flag once per Decode(), to select a parsing function, no?

Is there an issue proposing a decoding options type? If not, this could be turned into one.

What other decoding options have been requested to date?

mvdan commented 3 years ago

We add an example (or helper function) with the snippet above that shows how easy it is to manually compute line:column information.

I agree with this solution, if we want a solution.

What other decoding options have been requested to date?

https://github.com/golang/go/issues?q=is%3Aissue+is%3Aopen+encoding+json+in%3Atitle+label%3AProposal is a good start.

Is there an issue proposing a decoding options type? If not, this could be turned into one.

Personally, I don't think a GitHub thread is a good place to bikeshed rethinking the encoding/json options API. I think this thread should stay on topic.

networkimprov commented 3 years ago

The only solution suitable for long json streams is the one Alexander suggested here: https://github.com/golang/go/issues/43513#issuecomment-755285465. It's not very easy.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

dsnet commented 3 years ago

The only solution suitable for long json streams

Providing line:column information for large inputs is a weak use case.

The primary purpose of line:column position information is for human consumption. By nature, line:column information is a poor mechanism for programmatically identifying the location of an error since it fundamentally requires O(n) processing to count the number of preceding lines. In contrast, a byte offset is always O(1) since it identifies exactly where in the input a problem occurred and it's suitable for machine consumption as you can always accurately convert it to a line:column position. Thus, line:column information is generally only useful for inputs that are only a few KiBs in size. Even many text editors today start choking when trying to open a text file that are a few MiB in size.

Worst case, you check the flag once per Decode(), to select a parsing function, no?

That technique assumes that there is a specialized parsing function for each possible combination of features. Since combinations are fundamentally O(n!), this incurs significant implementation complexity (and associated binary bloat). As someone who would potentially be responsible for maintaining encoding/json for the long-run, I'm fairly opposed to over-specialization.

Perhaps we should tag this as ProposalHold; to be considered when a DecodeOptions type emerges.

I don't believe that is necessary. It presumes that an API just to opt into logic for line-number tabulation is the direction we want to head. Personally, I don't think an option just to opt-into line:column tabulation is justified.

jimsmart commented 1 year ago

I think it would be good to update the json.Decoder to feature an InputPos method — to bring it in line with both xml.Decoder.InputPos and csv.Reader.FieldPos

It's easy to argue that OffsetPos is enough, because JSON is generally streamed and not read from a file. But I believe that ignores an important common usage pattern here...

— The first thing one often does when faced with an issue when decoding a large complex JSON stream that fails somewhere in the middle, is to capture/save it to disk, as a test/debug case, and then open the file in a text editor (because these are human-readable files, by design), and go and take a look at the exact place where it has failed, to see what's what, and understand what it is that one's JSON-based decoder has fallen over on (whether it is the decoder or whatever, perhaps external tool, is creating the JSON).

IMO not having the decoder be able to return line+column info simply adds friction to working with complex JSON in Go.

All I want to be able to do, is easily find out where exactly my JSON stream decoder is failing, i.e. what incoming tag combination it is failing on (that it doesn't handle or handles incorrectly). And not having a simple API call to do this, when other decoders (XML and CSV) already feature this, seems like a painful oversight. Particularly when the main argument against it is performance, it's not a fast parser anyhow — particularly in the light that most folk are relying on built-in reflection to unmarshal / decode stuff. Personally, I'd rather take the additional minor performance hit parsing every JSON file, than have to jump through additional hoops when it comes to debugging some problem that should be much simpler to debug than it currently is.

Please bring json.Decoder inline with the other text-based decoders here — if adding such functionality to those decoders (ostensibly only for debugging / problem solving), at the cost of some performance was acceptable there, then it should also be perfectly acceptable here.

Related: https://github.com/golang/go/issues/45628 — where @rsc says:

I'm skeptical that the counting fundamentally can't be done efficiently even in a faster JSON. But that's not the issue today.

golang / go