philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.06k stars 156 forks source link

Create a built in HTML parser #37

Open philss opened 9 years ago

philss commented 9 years ago

Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.

The parser goals are:

philss commented 8 years ago

Here is a test case with an example of error that Floki does not support today: https://github.com/henrik/sipper/commit/49a4c09afa8773f9253401608f89c8d1545124cf

Thanks @henrik for the example!

gmile commented 8 years ago

@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?

philss commented 8 years ago

@gmile yeah, I thought about that, but what I want is to not depend on an external dependency. This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.

But, this is not discarded. I also think Servo's HTML is a good option.

gmile commented 8 years ago

But, this is not discarded

@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.

philss commented 8 years ago

@gmile I'm not looking into this right now. So, please go for it. 👍

baron commented 8 years ago

I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.

iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}

Would a replacement function recreate this behavior for backwards compatibility or break the api?

BTW, thanks for the awesome library!

Eiji7 commented 7 years ago

It would be awesome to have something like this:

%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}

instead of:

{"p", [], []}
"content"
{comment: "content"}

I was think also about:

Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs

Features:

Optional features:

ghost commented 7 years ago

Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...

As far as html5ever, check out https://github.com/hansihe/Rustler

philss commented 7 years ago

@mhsjlw I agree. Please follow this issue for more details: https://github.com/philss/floki/issues/94 (sorry for the delay 😅 ).

philss commented 7 years ago

@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!

Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser

gmile commented 7 years ago

@philss wow, that's awesome! Thanks!

liveresume commented 7 years ago

Rust NIFs anyone?

https://github.com/servo/html5ever

;)

ghost commented 7 years ago

@liveresume this was mentioned, twice, see https://github.com/philss/floki/issues/37#issuecomment-272662395 and https://github.com/philss/floki/issues/37#issuecomment-286318944

f34nk commented 6 years ago

Please have a look at: https://github.com/Overbryd/myhtmlex

Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml

@Overbryd gave a talk about it in Berlin I would love to see this coming together!

Overbryd commented 6 years ago

@f34nk Happy to help on this one.

I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.

I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.

philss commented 6 years ago

I didn't know we had bindings for myhtml. That's great! Thank you for the work on that, @Overbryd!

We could for sure write an adapter like we did for html5ever parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.

Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?