golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.13k stars 17.69k forks source link

x/net/html: Tokenizer could track Line/Column of token #31312

Open erinpentecost opened 5 years ago

erinpentecost commented 5 years ago

What version of Go are you using (go version)?

$ go version
go version go1.12.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build083125343=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I wanted to get the line and column for the current Token, which forced me to fork the package. The key change is adding something like the following to the end of readByte() in token.go:

    // Increment the line and column tracker
    if x == '\n' {
        z.currentLine++
        z.currentColumn = 0
    } else {
        z.currentColumn++
    }

What did you expect to see?

I'd like to see a public method on Tokenizer that returns the starting and ending line/column of the current Token. The method could return a new struct with these four values.

What did you see instead?

There isn't a way to figure out where the token is in the input aside from byte offset. I could feed that byte offset into user code to determine where the line/column is, but then I'd have to parse the input twice and build up that lookup table first.

bradleypeabody commented 5 years ago

The golang.org/x/net/html package already has an option mechanism, an option could be added to track the position. @bcmills @bradfitz How about this as a design:

I also have an another option I'd like to add which would be to disable the lower casing of element and attribute names. This could be added using a ParseOptionPreserveCase() function, no other type changes would be needed.

These changes seem fairly low impact. They would not break existing users of the package. And if not used would only add a relatively few bytes to the Token and Node structs. (And if space is a concern, the field could be defined as Position *Position to be smaller in memory for the case where it's not used) We'll need to see but performance impact should be fairly minimal when these features are enabled, and should be no different from existing code if not used.

I will probably end up putting together a prototype of this as I need this case preservation option (and the line numbers would definitely be nice) for some functionality being added to github.com/vugu/vugu - any feedback on the approach would be greatly appreciated, so we improve the possibility of getting the changes merged back in at some point.

nigeltao commented 5 years ago

I wanted to get the line and column for the current Token, which forced me to fork the package.

I'm not sure if a fork is necessary. The Tokenizer takes an io.Reader, and that io.Reader can build the mapping from byte offset to line:col numbers. If I understand https://github.com/golang/go/issues/34302 correctly, its LineCounter does exactly that.