Open philss opened 9 years ago
Here is a test case with an example of error that Floki does not support today: https://github.com/henrik/sipper/commit/49a4c09afa8773f9253401608f89c8d1545124cf
Thanks @henrik for the example!
@philss creating an html parser from scratch sounds like a huge amount of work. Have you thought about depending on a C library instead, such as this one https://github.com/google/gumbo-parser?
@gmile yeah, I thought about that, but what I want is to not depend on an external dependency. This came from a bit of frustration with the Nokogiri ruby gem. It uses libxml2 and FFI to make the bridge. It failed so many times to compile with me that I didn't like the experience.
But, this is not discarded. I also think Servo's HTML is a good option.
But, this is not discarded
@philss that said, are you specifically looking forward the Servo's HTML implementation? Otherwise, I could play with gumbo-parser integration and see how it goes.
@gmile I'm not looking into this right now. So, please go for it. 👍
I was wondering what the expected behavior of a native html parser would be. Right now mochiweb_html.parse always returns empty lists in either the middle or the end (depending on what level of nesting the html has). I'm not sure if this is a bug or feature but it was confusing when I first started using the library because I was hoping for some kind of "to_hash" like function in ruby.
iex(33)> htm = """
...(33)> <ul>
...(33)> <li>fooo</li>
...(33)> <li>bar</li>
...(33)> </ul>
...(33)> """
"<ul>\n<li>fooo</li>\n<li>bar</li>\n</ul>\n"
iex(34)> :mochiweb_html.parse(htm)
{"ul", [], [{"li", [], ["fooo"]}, {"li", [], ["bar"]}]}
Would a replacement function recreate this behavior for backwards compatibility or break the api?
BTW, thanks for the awesome library!
It would be awesome to have something like this:
%Floki.Leaf.Comment(content: "comment content"}
%Floki.Leaf.Node{attributes: [], children: [], events: [], name: "p", styles: []}
# events and styles are optional (I was think about something like browser inspector)
%Floki.Leaf.TextNode{content: "content"}
instead of:
{"p", [], []}
"content"
{comment: "content"}
I was think also about:
Floki.DocType.parse() # returns struct like:
%Floki.Document.HTML5{dom_tree: nil, lang: "en"}
Floki.DocumentParser # protocol for document structs
Features:
Optional features:
<div style='fontt-color: white;'></div>
Yeah, XPath would be awesome, especially when scraping data from a website. Chrome can automatically generate XPath paths for you to specifically grab tags which would save me a lot of pattern matching...
As far as html5ever, check out https://github.com/hansihe/Rustler
@mhsjlw I agree. Please follow this issue for more details: https://github.com/philss/floki/issues/94 (sorry for the delay 😅 ).
@gmile I totally forgot to update you, but right now is possible to use Servo's HTML parser with Floki!
Please follow these instructions: https://github.com/philss/floki#optional---using-http5ever-as-the-html-parser
@philss wow, that's awesome! Thanks!
@liveresume this was mentioned, twice, see https://github.com/philss/floki/issues/37#issuecomment-272662395 and https://github.com/philss/floki/issues/37#issuecomment-286318944
Please have a look at: https://github.com/Overbryd/myhtmlex
Based on Alexander Borisov’s myhtml, this binding gains the properties of being html-spec compliant and very fast. https://github.com/lexborisov/myhtml
@Overbryd gave a talk about it in Berlin I would love to see this coming together!
@f34nk Happy to help on this one.
I also wrote https://github.com/Overbryd/nodex that can be used to provide a safe execution (c-)node to give the best in performance/safety.
I would refrain from using myhtmlex widely as a NIF without explicitly checking the crash-safety requirements of the application requiring it. So maybe providing two modes of operation (NIF and C-Node) might be the best way to go for a widely used package.
I didn't know we had bindings for myhtml
. That's great! Thank you for the work on that, @Overbryd!
We could for sure write an adapter like we did for html5ever
parser. I don't know yet how we would enable the configuration of a C-Node, or if this is needed for the adapter. We can elaborate more ideas on that.
Thank you for letting us know, @f34nk! Can you open a new issue with the proposal?
Floki needs a HTML parser built in, in order to remove the mochiweb dependency. This will enable more flexibility and better control of the parsing step.
The parser goals are: