philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.05k stars 155 forks source link

Proposal: Add Floki.Doc #457

Open wojtekmach opened 1 year ago

wojtekmach commented 1 year ago

Hi!

I maintain a tiny Floki wrapper called EasyHTML which adds a struct around nodes and thus we can implement protocols and behaviours. Here's an example:

Mix.install([:easyhtml])

html = """
<!doctype html>
<html>
<body>
  <p class="headline">Hello, World!</p>
</body>
</html>
"""

doc = EasyHTML.parse!(html)

doc
#=> #EasyHTML[<html><body><p class="headline">Hello, World!</p></body></html>]

doc["p.headline"]
#=> #EasyHTML[<p class="headline">Hello, World!</p>]

doc["#bad"]
#=> nil

to_string(doc)
#=> "Hello, World!"

I'd like to add a Floki.Doc struct and a Floki.Doc.parse!/1 function.

Feedback appreciated!

wojtekmach commented 1 year ago

@philss I remember we talked a little bit about it but I don't remember much. :) I think the main concern was we obviously cannot return this from Floki.parse* functions as it would be a major breaking change. I think we solve this with a separate module.

If we go with the struct, I'm curious whether Floki.attr and Floki.attribute functions would work on it or we should have equivalents on the struct module.

Btw, is the distinction between document and fragment such that the former always contains exactly one root element? If so the struct could have attributes field which would make accessing these super convenient. But then again I'd guess working with fragments is more common. So maybe we have two different structs after all?

Hey maybe I do remember parts of our earlier conversations. :)

philss commented 1 year ago

I'd like to add a Floki.Doc struct and a Floki.Doc.parse!/1 function.

I think the main concern was we obviously cannot return this from Floki.parse* functions as it would be a major breaking change.

@wojtekmach yeah, I think it's aligned with what we discussed. We wanted to avoid this breaking change, but I think in the future this "Doc.parse" could be the main API. I'm not sure if we discussed what would be the struct, but I imagine it would be the tree representation, like we have in Floki.HTMLTree. Is this what you are thinking?

If we go with the struct, I'm curious whether Floki.attr and Floki.attribute functions would work on it or we should have equivalents on the struct module.

We would probably want to add support for the new struct on these functions.

Btw, is the distinction between document and fragment such that the former always contains exactly one root element?

Structurally speaking, yes. But semantically the document is something that has the root element being "", but the specs say that we need a <!doctype html> as well (we are just ignoring this part today). Fragments don't have this restriction, but I'm not sure if we should have another struct for them.

Something that can help us if we go for two structs is the specs (they are too complex, so we shouldn't worry that much):

Hey maybe I do remember parts of our earlier conversations. :)

:D

wojtekmach commented 1 year ago

Sorry, I wasn’t aware of HTMLTree struct. I didn’t really look into internals at all. 😅

viniciusmuller commented 1 year ago

In case this gets implemented, I would suggest the name to be Floki.Document instead of Floki.Doc, since I read this issue and thought it was something documentation-related

wojtekmach commented 1 year ago

If, per https://github.com/philss/floki/issues/463, we have maps as attributes and we add an ~HTML sigil (as a macro) we'd get these map match semantics for free:

html = ~HTML"""
<p class="p1">foo</p>
<p class="p2">bar</p>
"""

# these two are equivalent
assert ~HTML[<p class="p2">bar</p>] = html[".p2"]
assert ~HTML[<p>bar</p>] = html[".p2"]

assert html[".p2"] == ~HTML[<p class="p2">bar</p>]

which is potentially very interesting for testing.

mischov commented 1 year ago

@wojtekmach This is pretty similar to how Meeseeks already works. https://github.com/mischov/meeseeks/blob/8ac9b48b6f8b1daae18f9b0773882cf83c094777/lib/meeseeks/document.ex#L26-L50

wojtekmach commented 1 year ago

Similar how?

FWIW EasyHTML mentioned at the beginning uses the "floki ast", the one returned from Floki.parse* functions. The querying-optimised one in Meeseeks is very interesting. I guess the point is if we use a struct we can consider the ast as implementation detail and pick either!

mischov commented 1 year ago

Similar in that it already implements the output of both parsing and selection in terms of structs (and provides a nice toolkit for working with those structs), meaning the building blocks are in place for something like EasyHTML.

wojtekmach commented 1 year ago

Ah, makes sense!

mischov commented 1 year ago

It also goes beyond a single Node struct and has a top level Document struct, as well as Comment, Data, Doctype, Element, ProcessingInstruction, and Text structs, which is something else to consider.