Open bradleypeabody opened 5 years ago
/cc @nigeltao @namusyaka
The short answer is that I'm not enthusiastic.
Re Token offsets, you might already be able to report the line/column of a token, for error tracking: https://github.com/golang/go/issues/28343#issuecomment-455388101
Node offsets (as opposed to Token offsets) are a tricker concept. A large part of the HTML5 parsing algorithm is fixing up HTML that wouldn't be well formed XML, such as dealing with improperly nested tags. This can create nodes that have no corresponding token. What offset should we give those? Also, some nodes are formed form multiple tokens: e.g. separate start and end tags. Will Node.Offset be sufficient, or would you also need Node.StartOffset and Node.EndOffset?
Re OrigData, what's the garbage cost of all the extra strings? In any case, HTML is case insensitive. If you're trying to use HTML in a case sensitive way, then you're not really using HTML in an HTML5 way, and this package is about HTML as specified by HTML5. If you really must have case sensitive names, then you could possibly have a separate processing step to find all the <fooBar>
tags and build a map from "foobar" to "fooBar", without requiring any changes to the html package. It's also not clear, when you say "code generation based on HTML", whether that HTML is hand-written or tool-written. If it's tool-written, the tool could emit additional things (e.g. tag attributes), without having to fight HTML5's case insensitivity.
Thanks @nigeltao, I've looked at this in even more detail now and it seems you're right - the things I'm asking can in fact be done with the existing package, although I would argue it's very not-obvious that this is the case.
Following your suggestion from here https://github.com/golang/go/issues/28343 and tracking len(t.Raw())
, gives us something like:
offset := 0
for tt := z.Next(); tt != html.ErrorToken {
offset += len(z.Raw())
}
In this case is offset
guaranteed to always have the correct value? (i.e. there are no cases where the tokenizer skips over any bytes for any reason without including them in a token?) I did some simple test cases and looked at the code and it seems like this is correct, but without reading through the tokenizer in detail and in full I can't know for sure. If this is the case I think at a minimum it should be documented, so at least the next person looking for an offset knows they can safely calculate it themselves.
The other issue is that this line https://github.com/golang/net/blob/master/html/token.go#L1158 lower cases the attribute key in-place before returning it. So while this is solvable in various ways (for example making a copy of z.Raw()
before getting any attributes and then unpacking the slice header for that lower-cased key and determining the offset and looking at those bytes in my original byte copy), it's quite obscure. I'm still looking at this and if I come up with a simple and low-impact way to solve this I will give an update with that.
EDIT: Bold meant only for readability of the key points.
is offset guaranteed to always have the correct value... it should be documented
You're right. I sent out https://go-review.googlesource.com/c/net/+/198357
I'm going to close this issue. Re lower-casing keys, I'll repeat that HTML5 is specified to be case-insensitive, and in any case, it sounds like you can work around it. If you can't, feel free to re-open this issue with further thoughts.
Thanks @nigeltao for the documentation update, appreciated! That handles the offset issue.
Regarding the lower-casing, I'll split my reply up into three clear parts - the issue (what remains of it), the proposal and the motivation. I don't see a way for me to re-open this issue. I can make a new one if that's useful.
Using Tokenizer.Raw()
it is trivial to obtain the original element text, pre-lower-casing. However, doing the same for attribute names is not possible without either parsing the result of Tokenizer.Raw()
manually, or carefully examining the key []byte
returned from TagAttr
and using unsafe
to look at the offset into what Tokenizer.Raw() had before TagAttr was called (since that call lower cases the attr key). It's possible to do but highly dependent upon the internals of this package and not easy. Painful and brittle.
How about adding the following method:
func (z *Tokenizer) RawTagAttr() (key, val []byte)
It would have the same behavior as TagAttr except it would return the key and val as-is, without calling lower
(and without unescaping the value), and it would not increment z.nAttrReturned
. The implementation seems trivial (mostly copied from TagAttr):
if z.nAttrReturned < len(z.attr) {
switch z.tt {
case StartTagToken, SelfClosingTagToken:
x := z.attr[z.nAttrReturned]
key = z.buf[x[0].start:x[0].end]
val = z.buf[x[1].start:x[1].end]
return key, val
}
}
return nil, nil
I fully understand that the logic that per the HTML spec, attribute keys are case-insensitive, thus lower-casing them should have no negative impact on programs using the parser. If the lower-cased version is semantically the same, why not make it consistent - I get that.
My main counter argument is that if allowing access to the original input does not muck up the library too much, then I don't see a reason to disallow callers to get at it. I don't have a way of knowing how many other people run into this issue, but I can summarize my specific use case:
I am the primary author of Vugu (https://github.com/vugu/vugu), which is a library that (among other things) takes a hand-edited HTML document and writes corresponding Go code. This Go code runs in various environments (there's a in-browser WebAssembly case and server-side static file output case) to output a document (based on the original input) in that environment. Certain attributes, e.g. vg-if-"expr"
trigger corresponding functionality in the emitted Go code. In some more complex cases, the attribute names correspond to Go struct field names, and thus I need to know the original/un-lower-cased input.
Documents are 100% valid HTML, Vugu just derives additional meaning and performs additional actions based on certain elements and attributes.
I know this isn't the most common use case of this library, but I also don't think it's the most obscure.
--
Hopefully this proposal is more workable. It is a much lower-impact change, just a single method added with no modifications to tokenizer state, no struct fields added, nothing like that.
Please let me know what you think.
Certain attributes, e.g.
vg-if-"expr"
trigger corresponding functionality in the emitted Go code.
Should the second -
in vg-if-"expr"
be an =
instead? If so, the "expr"
part doesn't get lower-cased:
package main
import (
"fmt"
"log"
"os"
"golang.org/x/net/html"
)
func main() {
f, err := os.Open("vugufmt/testdata/ok/root.vugu")
if err != nil {
log.Fatal(err)
}
defer f.Close()
for z := html.NewTokenizer(f); ; {
tt := z.Next()
if tt == html.ErrorToken {
break
}
if (tt != html.StartTagToken) && (tt != html.SelfClosingTagToken) {
continue
}
tok := z.Token()
for _, a := range tok.Attr {
if a.Key != "vg-if" {
continue
}
fmt.Printf("%s vg-if %q\n", tok.Data, a.Val)
}
}
}
This program prints:
div vg-if "data.isLoading"
div vg-if "len(data.bpi.BPI) > 0"
and e.g. the "bpi" and "BPI" cases are preserved. The html package lower-cases the attribute keys but not the values.
the attribute names correspond to Go struct field names
Can you give an example, then, if vg-if
doesn't actually have a case-sensitivity problem, AFAICT?
I looked at https://www.vugu.org/doc/files/markup but didn't see anything that needed case-sensitive attribute keys. I'm assuming that the click
in <button @click="etc">
isn't case-sensitive.
@nigeltao Sorry about the confusion - the docs have not yet been updated to correspond with the latest code on master and this issue is coming up as part of a substantial refactor I did there. The relevant info is here: https://github.com/vugu/vugu/wiki/Refactor-Notes---Sep-2019
Example HTML from new root.vugu:
<html><body><ul>
<main:DemoLine vg-for="i := 0; i < c.ItemCount; i++" vg-key="i" :Num="i"></main:DemoLine>
</ul></body></html>
In this case I need main:DemoLine
from the element, and :Num
from the attribute, as these correspond to emitted Go code that creates and populates a struct, declared as type DemoLine struct { Num int }
.
I'll re-open the issue to consider adding a Tokenizer.RawTagAttr
or similar method.
Nonetheless, I would recommend something like <vg type="main.DemoLine" bind="Num:i" etc>
instead of <main:DemoLine :Num="i" etc>
, for a few reasons:
main.DemoLine
looks more Go-like (a package-qualified type name) than main:DemoLine
.svg:foobar
tags if your Go package happens to be called svg
.Thanks, I get what you mean on the naming. Finding syntax for the various Vugu features that is concise, readable, works with Go identifier conventions, doesn't conflict with other HTML and is also internally consistent is an interesting challenge. I'll be reviewing this and taking these points into account.
I am making surgical alterations to HTML, similar (it turns out) to vugu, but for translation purposes.
Using a full parse + re-render is overkill. (I also some hit early implementation issues with it, but sadly I no longer recall what they were.)
To move beyond text nodes to things like input placeholders, I too need to know the precise byte offsets of each attribute's name and value.
Tokenizer.RawTagAttr
would suffice for my purposes, assuming it provided all the relevant bytes (whitespace, the equals sign, quotes etc.).
Maybe this proposal can be revived?
After looking at https://github.com/golang/go/issues/31312 closer and trying some things out, I wanted to open a new issue with a more specific and succinct proposal:
Features
At a high level, the desired features are:
Motivation
The motivation for 1 is cases where a caller needs to know the context of a token or node so this can be reported to the user. Indicating the position of an error, or where in the source HTML a particular element comes from, etc.
The motivation for 2 is for situations where the caller simply needs to know what was originally there before lower-casing. It could be useful in error reporting to show the original element name for example. (The case for me specifically is I'm doing code generation based on HTML and so allowing the user to use mixed case in a tag name means they can specify the exact name of a Go struct with CamelCase - the documents in question are mostly standard HTML but of course these elements are treated differently.)
Observations
Tracking the byte offset of each token and node is simple and unambiguous.
Line and column information can be derived from this offset, but raises other questions like the definition of a line ending, how do tabs impact the reported column position, multi-byte characters, etc.
Thus these concerns would seem to be better handled outside the parser - if the parser reports the byte offset into the original input, that's enough.
Providing an option to keep the original case of an element in the
Data
field seems, in retrospect, dangerous. I can't prove that this won't cause things to break (the parser does many weird and wonderful things internally to handle special cases with different tags - for which I have a new found respect after diving deeper into the source :) ). A new field seems much more sensible.Both tracking offset information and preserving the original text require fields to be added to the appropriate structs (see below). The performance impact of them seems negligible. I thus opted to not implement them as ParserOptions, as it would provide little benefit to disable these features - the fields would still be there just unpopulated.
Changes
Here's a summary of how this could work (how it works in the prototype):
Add
OrigData string; Offset int
toToken
. OrigData is the same as data but before lower casing. (If they are the same they point to the exact same bytes.)Add
OrigData string; Offset int
toNode
, same behavior (fields are just copied from Token to Node).Add
OrigKey
toAttribute
. Original Key before lower casing. (In retrospect Attribute probably could have an Offset too, I might add that.)And basically just the book keeping to fill out those fields.
Separately, a LineCounter struct is added which is wraps an io.Reader and records the positions of every line ending and provides a call to go from an offset to line number and line start offset. Callers can use this to reconstruct "line:col" given a byte offset from Node.Offset, etc. Notice that the question of how to deal with tabs or multi-byte characters becomes the caller's responsibility; this is intentional. (I didn't bother to implement alternate line endings, it assumes \n for now, but this could be done easily.) I'm not sure if it is appropriate to have a LineCounter inside the HTML package, it could just as well be somewhere else.
Working Prototype
The working code with these modifications is here: https://github.com/vugu/html .
The main commit that touches the internal stuff is: https://github.com/vugu/html/commit/da33d265c2a8fa501665a122efbf1441162a6b3b
Some basic tests are in there, I'm sure more could be done. (I also haven't looked at code that writes out HTML to see how the originally-cased data can be used there too or if this would be an option or what.) I also understand if some of this needs to be split up, this proposal is a few different things lumped together.
I believe these features could be generally useful and so I wanted to see if this is something that could potentially end up being merged, and if so discuss what's required to get to that point.