golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.22k stars 17.7k forks source link

proposal: x/net/html: parser hooks #70275

Open stroiman opened 1 week ago

stroiman commented 1 week ago

Proposal Details

The following is not something I need right now, but my use case made me think that there is a valid reason for the code calling html.Parse to be able to influence the parsing process.

I'm playing around with a crazy idea, to create a headless browser written in Go (I have a POC that proves I can write the DOM in Go, and expose the objects to JavaScript - I have a v8 engine integrated).

I have used x/net/html to parse HTML; but there are some rules for constructing the DOM in the browser that the library doesn't support (and I don't think it should implement those rules - but let the caller deal with them).

For example, when a <script> element is connected, the script is executed. Which means the script doesn't see the entire "HTML source" (or the DOM representation of the entire source to be exact). Scripts aren't the only types of element with these types of rules wither.

I can work around that right now; but makes me want to ask the question, if support should be added to x/net/html that allows the caller to install hooks in the process. Events like inserted, connected, removed seems like candidates.

As per the DOM tree specs

An HTML element can have specific HTML element insertion steps, HTML element post-connection steps, and HTML element removing steps, all defined for the element's local name.

My current approach

The path I have decided to take forward is to first use the x/net/html package to construct a node tree, and then iterate that tree to construct my own tree. So basically I'll perform two passes of DOM processing, and I can implement the necessary rules in the 2nd pass.

I will most likely use the html.Node instance as a "backing field" for DOM data, as it gives the benefits:

Even if I could hook into the parser process I still need to create wrapper types for various reasons. E.g., I need to expose a different interface to JavaScript, and I need a binding layer to JavaScript.

What benefits would the hooks give

If the package supported hooks you could generate the "correct DOM" in one pass.

In my case, the wrapper objects could then just be created lazily.

Alternate solution

The suggested hooks is just one possible solution to the problem;

Another potential solution could be to let the caller pass in some kind of "Node Factory". But that does seem more complex.

A bit background about the idea

The idea was basically to be able to test HTTP apps written in Go; but where client-side script plays an important role. The standard library provides everything necessary if the test only need to inspect the HTTP response, e.g. response headers or body. But if you want to verify that the behavior JavaScript, you need something more. An example could be using HTMX and Go, a tech-combo that seems to be gaining some popularity. Here, the behavior of the app depends on setting the right attributes on specific HTML elements, and be sure that the returned HTTP responses play well with those tags.

Using a real browser in headless has a huge overhead; which tends to discourage TDD; which otherwise provides a fast feedback loop.

Being in written in Go, it easily bypass the TCP layer entirely and connect directly to the root net/http.Handler. This makes it easy to let tests run in parallel with each test using with their own http handler; which could possibly have different dependencies mocked or stubbed for different test cases.

The project is as (but this is an extremely early prototype) is in https://github.com/stroiman/go-dom

ianlancetaylor commented 1 week ago

In order to make a decision we'll need to see a reasonably complete description of the new API. That suggests that this should start out as a third-party package, perhaps based on x/net/html. Once that is working and useful we can see whether it makes sense to incorporate in the x/net repo.

stroiman commented 1 week ago

Sounds reasonable. I did consider forking at some point, but adjust entirely to my needs.

But I think there's a really good case for separation; the current x/net/html can deal with the issues of malformed HTML, a non-trivial problem on its own; and keep that logic completely separate from the rules handling specific element insertion steps.