Create a built in HTML parser

philss commented 9 years ago

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

[ ] support HTML5;
[ ] support HTML snippets;
[ ] be able to parse large files, like 15MB;
[ ] easy to traverse;
[ ] be a bit tolerant with errors, like missing closing tags.

philss commented 8 years ago

Here is a test case with an example of error that Floki does not support today: https://github.com/henrik/sipper/commit/49a4c09afa8773f9253401608f89c8d1545124cf

Thanks @henrik for the example!

gmile commented 8 years ago

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

philss commented 8 years ago

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency. This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

gmile commented 8 years ago

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

philss commented 8 years ago

@gmile I'm not looking into this right now. So, please go for it. 👍

baron commented 8 years ago

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

Eiji7 commented 7 years ago

It would be awesome to have something like this:

%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}

instead of:

{"p", [], []}
"content"
{comment: "content"}

I was think also about:

Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs

Features:

[ ] support all CSS3 (CSS4?) selectors
[ ] support XPath
[ ] log warnings when parsing + add option to raise on warning
[ ] add option to strip blank text node (default false)
[ ] add option to strip comment content (default true)
[ ] use Stream when possible
[ ] tag names and attribute names are always lower case like: "my-custom-tag" and "my-custom-data"
[ ] support detect encoding
[ ] allow validate only
[ ] support fetching parent(s) and sibling(s) from leaf struct ...
[ ] debug logs - for example: "missing title", "missing favicon" ...

Optional features:

[ ] method to collect styles for node (with priority, source file, line ...)
[ ] method to collect events for node
[ ] extra JQuery selectors, see docs
[ ] CSS validator with warnings/errors
```
<div style='fontt-color: white;'></div>
```

ghost commented 7 years ago

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

philss commented 7 years ago

@mhsjlw I agree. Please follow this issue for more details: https://github.com/philss/floki/issues/94 (sorry for the delay 😅 ).

philss commented 7 years ago

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

gmile commented 7 years ago

@philss wow, that's awesome! Thanks!

liveresume commented 7 years ago

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

ghost commented 7 years ago

@liveresume this was mentioned, twice, see https://github.com/philss/floki/issues/37#issuecomment-272662395 and https://github.com/philss/floki/issues/37#issuecomment-286318944

f34nk commented 6 years ago

Please have a look at: https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin I would love to see this coming together!

Overbryd commented 6 years ago

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

philss commented 6 years ago

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?

philss / floki

Create a built in HTML parser #37