x/net/html: Tokenizer could track Line/Column of token

erinpentecost commented 5 years ago

What version of Go are you using (`go version`)?

$ go version
go version go1.12.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build083125343=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I wanted to get the line and column for the current Token, which forced me to fork the package. The key change is adding something like the following to the end of readByte() in token.go:

    // Increment the line and column tracker
    if x == '\n' {
        z.currentLine++
        z.currentColumn = 0
    } else {
        z.currentColumn++
    }

What did you expect to see?

I'd like to see a public method on Tokenizer that returns the starting and ending line/column of the current Token. The method could return a new struct with these four values.

What did you see instead?

There isn't a way to figure out where the token is in the input aside from byte offset. I could feed that byte offset into user code to determine where the line/column is, but then I'd have to parse the input twice and build up that lookup table first.

bradleypeabody commented 5 years ago

The golang.org/x/net/html package already has an option mechanism, an option could be added to track the position. @bcmills @bradfitz How about this as a design:

A ParseOptionPosition() ParseOption function is added.
A struct Position is added, with three fields: Line, Column, Offset int (Line and Column are numbered from 1 in the usual convention, Offset is a 0-based byte offset into the original input)
A Position Position field is added to both Token and Node.
When ParseOptionPosition is used with ParseWithOptions or ParseFragmentWithOptions it populates the Position field on Token and Node as applicable. The Position field is otherwise left blank if option not used.

I also have an another option I'd like to add which would be to disable the lower casing of element and attribute names. This could be added using a ParseOptionPreserveCase() function, no other type changes would be needed.

These changes seem fairly low impact. They would not break existing users of the package. And if not used would only add a relatively few bytes to the Token and Node structs. (And if space is a concern, the field could be defined as Position *Position to be smaller in memory for the case where it's not used) We'll need to see but performance impact should be fairly minimal when these features are enabled, and should be no different from existing code if not used.

I will probably end up putting together a prototype of this as I need this case preservation option (and the line numbers would definitely be nice) for some functionality being added to github.com/vugu/vugu - any feedback on the approach would be greatly appreciated, so we improve the possibility of getting the changes merged back in at some point.

nigeltao commented 5 years ago

I wanted to get the line and column for the current Token, which forced me to fork the package.

I'm not sure if a fork is necessary. The Tokenizer takes an io.Reader, and that io.Reader can build the mapping from byte offset to line:col numbers. If I understand https://github.com/golang/go/issues/34302 correctly, its LineCounter does exactly that.

golang / go