proposal: x/net/html: add offset & line tracking, case preservation

bradleypeabody commented 5 years ago

After looking at https://github.com/golang/go/issues/31312 closer and trying some things out, I wanted to open a new issue with a more specific and succinct proposal:

Features

At a high level, the desired features are:

Line and column tracking (actually offset tracking, see below)
A way to get at the original element and attribute name text before lower casing.

Motivation

The motivation for 1 is cases where a caller needs to know the context of a token or node so this can be reported to the user. Indicating the position of an error, or where in the source HTML a particular element comes from, etc.

The motivation for 2 is for situations where the caller simply needs to know what was originally there before lower-casing. It could be useful in error reporting to show the original element name for example. (The case for me specifically is I'm doing code generation based on HTML and so allowing the user to use mixed case in a tag name means they can specify the exact name of a Go struct with CamelCase - the documents in question are mostly standard HTML but of course these elements are treated differently.)

Observations

Tracking the byte offset of each token and node is simple and unambiguous.
Line and column information can be derived from this offset, but raises other questions like the definition of a line ending, how do tabs impact the reported column position, multi-byte characters, etc.
Thus these concerns would seem to be better handled outside the parser - if the parser reports the byte offset into the original input, that's enough.
Providing an option to keep the original case of an element in the Data field seems, in retrospect, dangerous. I can't prove that this won't cause things to break (the parser does many weird and wonderful things internally to handle special cases with different tags - for which I have a new found respect after diving deeper into the source :) ). A new field seems much more sensible.
Both tracking offset information and preserving the original text require fields to be added to the appropriate structs (see below). The performance impact of them seems negligible. I thus opted to not implement them as ParserOptions, as it would provide little benefit to disable these features - the fields would still be there just unpopulated.

Changes

Here's a summary of how this could work (how it works in the prototype):

Add OrigData string; Offset int to Token. OrigData is the same as data but before lower casing. (If they are the same they point to the exact same bytes.)
Add OrigData string; Offset int to Node, same behavior (fields are just copied from Token to Node).
Add OrigKey to Attribute. Original Key before lower casing. (In retrospect Attribute probably could have an Offset too, I might add that.)
And basically just the book keeping to fill out those fields.
Separately, a LineCounter struct is added which is wraps an io.Reader and records the positions of every line ending and provides a call to go from an offset to line number and line start offset. Callers can use this to reconstruct "line:col" given a byte offset from Node.Offset, etc. Notice that the question of how to deal with tabs or multi-byte characters becomes the caller's responsibility; this is intentional. (I didn't bother to implement alternate line endings, it assumes \n for now, but this could be done easily.) I'm not sure if it is appropriate to have a LineCounter inside the HTML package, it could just as well be somewhere else.

Working Prototype

The working code with these modifications is here: https://github.com/vugu/html .

The main commit that touches the internal stuff is: https://github.com/vugu/html/commit/da33d265c2a8fa501665a122efbf1441162a6b3b

Some basic tests are in there, I'm sure more could be done. (I also haven't looked at code that writes out HTML to see how the originally-cased data can be used there too or if this would be an option or what.) I also understand if some of this needs to be split up, this proposal is a few different things lumped together.

I believe these features could be generally useful and so I wanted to see if this is something that could potentially end up being merged, and if so discuss what's required to get to that point.

toothrot commented 5 years ago

/cc @nigeltao @namusyaka

nigeltao commented 5 years ago

The short answer is that I'm not enthusiastic.

Re Token offsets, you might already be able to report the line/column of a token, for error tracking: https://github.com/golang/go/issues/28343#issuecomment-455388101

Node offsets (as opposed to Token offsets) are a tricker concept. A large part of the HTML5 parsing algorithm is fixing up HTML that wouldn't be well formed XML, such as dealing with improperly nested tags. This can create nodes that have no corresponding token. What offset should we give those? Also, some nodes are formed form multiple tokens: e.g. separate start and end tags. Will Node.Offset be sufficient, or would you also need Node.StartOffset and Node.EndOffset?

Re OrigData, what's the garbage cost of all the extra strings? In any case, HTML is case insensitive. If you're trying to use HTML in a case sensitive way, then you're not really using HTML in an HTML5 way, and this package is about HTML as specified by HTML5. If you really must have case sensitive names, then you could possibly have a separate processing step to find all the <fooBar> tags and build a map from "foobar" to "fooBar", without requiring any changes to the html package. It's also not clear, when you say "code generation based on HTML", whether that HTML is hand-written or tool-written. If it's tool-written, the tool could emit additional things (e.g. tag attributes), without having to fight HTML5's case insensitivity.

bradleypeabody commented 5 years ago

Thanks @nigeltao, I've looked at this in even more detail now and it seems you're right - the things I'm asking can in fact be done with the existing package, although I would argue it's very not-obvious that this is the case.

Following your suggestion from here https://github.com/golang/go/issues/28343 and tracking len(t.Raw()), gives us something like:

offset := 0
for tt := z.Next(); tt != html.ErrorToken {
    offset += len(z.Raw())
}

In this case is offset guaranteed to always have the correct value? (i.e. there are no cases where the tokenizer skips over any bytes for any reason without including them in a token?) I did some simple test cases and looked at the code and it seems like this is correct, but without reading through the tokenizer in detail and in full I can't know for sure. If this is the case I think at a minimum it should be documented, so at least the next person looking for an offset knows they can safely calculate it themselves.

The other issue is that this line https://github.com/golang/net/blob/master/html/token.go#L1158 lower cases the attribute key in-place before returning it. So while this is solvable in various ways (for example making a copy of z.Raw() before getting any attributes and then unpacking the slice header for that lower-cased key and determining the offset and looking at those bytes in my original byte copy), it's quite obscure. I'm still looking at this and if I come up with a simple and low-impact way to solve this I will give an update with that.

EDIT: Bold meant only for readability of the key points.

nigeltao commented 5 years ago

is offset guaranteed to always have the correct value... it should be documented

You're right. I sent out https://go-review.googlesource.com/c/net/+/198357

nigeltao commented 5 years ago

I'm going to close this issue. Re lower-casing keys, I'll repeat that HTML5 is specified to be case-insensitive, and in any case, it sounds like you can work around it. If you can't, feel free to re-open this issue with further thoughts.

bradleypeabody commented 5 years ago

Thanks @nigeltao for the documentation update, appreciated! That handles the offset issue.

Regarding the lower-casing, I'll split my reply up into three clear parts - the issue (what remains of it), the proposal and the motivation. I don't see a way for me to re-open this issue. I can make a new one if that's useful.

Remaining Issue

Using Tokenizer.Raw() it is trivial to obtain the original element text, pre-lower-casing. However, doing the same for attribute names is not possible without either parsing the result of Tokenizer.Raw() manually, or carefully examining the key []byte returned from TagAttr and using unsafe to look at the offset into what Tokenizer.Raw() had before TagAttr was called (since that call lower cases the attr key). It's possible to do but highly dependent upon the internals of this package and not easy. Painful and brittle.

Proposal

How about adding the following method:

func (z *Tokenizer) RawTagAttr() (key, val []byte)

It would have the same behavior as TagAttr except it would return the key and val as-is, without calling lower (and without unescaping the value), and it would not increment z.nAttrReturned. The implementation seems trivial (mostly copied from TagAttr):

    if z.nAttrReturned < len(z.attr) {
        switch z.tt {
        case StartTagToken, SelfClosingTagToken:
            x := z.attr[z.nAttrReturned]
            key = z.buf[x[0].start:x[0].end]
            val = z.buf[x[1].start:x[1].end]
            return key, val
        }
    }
    return nil, nil

Motivation

I fully understand that the logic that per the HTML spec, attribute keys are case-insensitive, thus lower-casing them should have no negative impact on programs using the parser. If the lower-cased version is semantically the same, why not make it consistent - I get that.

My main counter argument is that if allowing access to the original input does not muck up the library too much, then I don't see a reason to disallow callers to get at it. I don't have a way of knowing how many other people run into this issue, but I can summarize my specific use case:

I am the primary author of Vugu (https://github.com/vugu/vugu), which is a library that (among other things) takes a hand-edited HTML document and writes corresponding Go code. This Go code runs in various environments (there's a in-browser WebAssembly case and server-side static file output case) to output a document (based on the original input) in that environment. Certain attributes, e.g. vg-if-"expr" trigger corresponding functionality in the emitted Go code. In some more complex cases, the attribute names correspond to Go struct field names, and thus I need to know the original/un-lower-cased input.

Documents are 100% valid HTML, Vugu just derives additional meaning and performs additional actions based on certain elements and attributes.

I know this isn't the most common use case of this library, but I also don't think it's the most obscure.

--

Hopefully this proposal is more workable. It is a much lower-impact change, just a single method added with no modifications to tokenizer state, no struct fields added, nothing like that.

Please let me know what you think.

nigeltao commented 5 years ago

Certain attributes, e.g. vg-if-"expr" trigger corresponding functionality in the emitted Go code.

Should the second - in vg-if-"expr" be an = instead? If so, the "expr" part doesn't get lower-cased:

package main

import (
    "fmt"
    "log"
    "os"

    "golang.org/x/net/html"
)

func main() {
    f, err := os.Open("vugufmt/testdata/ok/root.vugu")
    if err != nil {
        log.Fatal(err)
    }
    defer f.Close()

    for z := html.NewTokenizer(f); ; {
        tt := z.Next()
        if tt == html.ErrorToken {
            break 
        }   
        if (tt != html.StartTagToken) && (tt != html.SelfClosingTagToken) {
            continue
        }
        tok := z.Token()
        for _, a := range tok.Attr {
            if a.Key != "vg-if" {
                continue
            }
            fmt.Printf("%s vg-if %q\n", tok.Data, a.Val)
        }
    }
}

This program prints:

div vg-if "data.isLoading"
div vg-if "len(data.bpi.BPI) > 0"

and e.g. the "bpi" and "BPI" cases are preserved. The html package lower-cases the attribute keys but not the values.

the attribute names correspond to Go struct field names

Can you give an example, then, if vg-if doesn't actually have a case-sensitivity problem, AFAICT?

I looked at https://www.vugu.org/doc/files/markup but didn't see anything that needed case-sensitive attribute keys. I'm assuming that the click in <button @click="etc"> isn't case-sensitive.

bradleypeabody commented 5 years ago

@nigeltao Sorry about the confusion - the docs have not yet been updated to correspond with the latest code on master and this issue is coming up as part of a substantial refactor I did there. The relevant info is here: https://github.com/vugu/vugu/wiki/Refactor-Notes---Sep-2019

Example HTML from new root.vugu:

<html><body><ul>
<main:DemoLine vg-for="i := 0; i < c.ItemCount; i++" vg-key="i" :Num="i"></main:DemoLine>
</ul></body></html>

In this case I need main:DemoLine from the element, and :Num from the attribute, as these correspond to emitted Go code that creates and populates a struct, declared as type DemoLine struct { Num int }.

nigeltao commented 5 years ago

I'll re-open the issue to consider adding a Tokenizer.RawTagAttr or similar method.

Nonetheless, I would recommend something like <vg type="main.DemoLine" bind="Num:i" etc> instead of <main:DemoLine :Num="i" etc>, for a few reasons:

This will play better with other HTML tools that assume that, as per the spec, HTML is case-insensitive. For example, pretty-printers might canonicalize tags and attribute keys to lower case.
The main.DemoLine looks more Go-like (a package-qualified type name) than main:DemoLine.
You don't have to worry about any HTML/XML special-case weirdness re svg:foobar tags if your Go package happens to be called svg.

bradleypeabody commented 5 years ago

Thanks, I get what you mean on the naming. Finding syntax for the various Vugu features that is concise, readable, works with Go identifier conventions, doesn't conflict with other HTML and is also internally consistent is an interesting challenge. I'll be reviewing this and taking these points into account.

josharian commented 1 year ago

I am making surgical alterations to HTML, similar (it turns out) to vugu, but for translation purposes.

Using a full parse + re-render is overkill. (I also some hit early implementation issues with it, but sadly I no longer recall what they were.)

To move beyond text nodes to things like input placeholders, I too need to know the precise byte offsets of each attribute's name and value.

Tokenizer.RawTagAttr would suffice for my purposes, assuming it provided all the relevant bytes (whitespace, the equals sign, quotes etc.).

Maybe this proposal can be revived?

golang / go