Open yasoob opened 9 months ago
Hi, @yasoob!
This is an interesting idea. I think it is not possible with the current data structure, as you said, but could work if we have a more "complete" tree representation like we do internally today - see Floki.HTMLTree
. We already started to discuss a "wrapper" around the results in #457, and I think that wrapper could be this tree, which could include more information like the position of a tag.
However I'm not sure the amount of work required if we decide to expose that from the floki_mochi_html
tokenizer. I will investigate. But I would say this is feasible, yeah.
Just an additional context: the Mochiweb parser is not the most aligned with the specs 😅 So I'm afraid it could contain wrong data about the position of the elements. That said, I started working in a new parser a long time ago, but this was never finished. I think the correct path - after exposing this in the "wrapper" - would be to finish the parser that is aiming to work according to HTML specs. This could take some time, though.
So as a fun experiment I spent some time yesterday looking into it. I wanted to get the line numbers of the:
I ended up updating the tokenize
function like this:
tokenize(B, S = #decoder{offset = O}) ->
case B of
%% ... Truncated ...
<<_:O/binary, "</", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?ADV_COL(S, 2)),
{S2, _} = find_gt(B, S1),
{{end_tag, Tag, {line_no, S#decoder.line}}, S2};
<<_:O/binary, "<", C, _/binary>> when
?IS_WHITESPACE(C); not ?IS_LETTER(C)
->
%% This isn't really strict HTML
{{data, Data, _Whitespace}, S1} = tokenize_data(B, ?INC_COL(S)),
{{data, <<$<, Data/binary>>, false}, S1};
<<_:O/binary, "<", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?INC_COL(S)),
{Attrs, S2} = tokenize_attributes(B, S1),
{S3, HasSlash} = find_gt(B, S2),
Singleton = HasSlash orelse is_singleton(Tag),
{{start_tag, Tag, Attrs, Singleton, {line_no, S#decoder.line}}, S3};
_ ->
tokenize_data(B, S)
end.
I did something similar for the attributes and added line numbers there as well. So if I directly use this new tokenize function like this:
:floki_mochi_html.tokens(doc)
It produces such output:
{:start_tag, "style", [{"type", "text/css", {:line_no, 13}}], false,
{:line_no, 13}},
{:end_tag, "style", {:line_no, 68}},
I checked the line numbers in the output and they were correct. But as you can imagine, this output can't really be used for any further processing as all other functions expect a different data structure. This is a long winded way of saying that it is not only feasible but works correctly as well in the scenarios that I tested.
As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example? This would help me see if it breaks the kind of work I am trying to do. I don't really care for the final HTML output to be "correct". As in, I don't want Mochiweb to add a missing tag in the final output to make it compliant. But I do want it to accurately tokenize what is present in the input. I actually want the broken output where the tags that are missing in the source are also missing in the tokenized output. This would have been much easier to implement if we had a low level tokenizer in Elixir but mochiweb is what we have.
I had previously tried to add this support in the html5ever NIF as it also calls an internal method to update the line number during parsing/tokenizing according to this issue. I managed to get as far as getting a line number printed in the terminal but it wasn't super reliable and my rust is very "rusty". I doubt I can get anywhere with that solution without learning more Rust. Maybe you or someone else who has more Rust experience can look into it.
If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.
I also wasn't aware of the HTMLTree
. I will look into it.
I checked the line numbers in the output and they were correct [...] it is not only feasible but works correctly as well in the scenarios that I tested.
This is awesome! :D
As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example?
I can say that most of our bugs are related to lack of support from our current parser. There is one example that can affect your output: multiple whitespace chars are collapsed to just one. So if you have multiple new lines, I think it is going to count incorrectly (I didn't try with your patch).
Maybe you or someone else who has more Rust experience can look into it.
If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.
I will take a look when I can!
I also wasn't aware of the HTMLTree.
Thinking now, I guess we would need to change the parsing to build the HTMLTree
directly, instead of building the tree as structs like is today.
I cannot promise to add the feature soon, but I will look forward to work on this. Also, if you feel comfortable, don't hesitate to sending PRs. They are more than welcome!
So I spent some time on this and was able to get the line number from Html5ever as well with the following changes:
line_no
field to the Node
struct and take it as an input while creating a new Node:pub struct Node {
id: NodeHandle,
line_no: u64,
children: PoolOrVec<NodeHandle>,
parent: Option<NodeHandle>,
data: NodeData,
}
impl Node {
fn new(id: usize, line_no: u64, data: NodeData, pool: &Vec<NodeHandle>) -> Self {
Node {
id: NodeHandle(id),
parent: None,
children: PoolOrVec::new(pool),
line_no: line_no,
data,
}
}
}
current_line
field in the FlatSink
struct and set it to 1 when creating the FlatSink:pub struct FlatSink {
pub root: NodeHandle,
pub nodes: Vec<Node>,
pub pool: Vec<NodeHandle>,
pub current_line: u64,
}
impl FlatSink {
pub fn new() -> FlatSink {
let mut sink = FlatSink {
root: NodeHandle(0),
nodes: Vec::with_capacity(200),
pool: Vec::with_capacity(2000),
current_line: 1,
};
// Element 0 is always root
sink.nodes
.push(Node::new(0, 1, NodeData::Document, &sink.pool));
sink
}
// ... trunc ...
}
set_current_line
method of the TreeSink
. This method is called by html5ever whenever html5ever moves to a new line during parsing:impl TreeSink for FlatSink {
// ... trunc ...
fn set_current_line(&mut self, line_number: u64) {
self.current_line = line_number;
}
}
make_node
method of the FlatSink
and populate the line_no
field of the Node
while creating a new Node:impl FlatSink {
// .. trunc ...
pub fn make_node(&mut self, data: NodeData) -> NodeHandle {
let node = Node::new(self.nodes.len(), self.current_line, data, &self.pool);
let id = node.id;
self.nodes.push(node);
id
}
}
line_no
field as well for each node in the encode_node
function:// Do this for all Node types:
NodeData::Document => map
.map_put(atoms::type_().encode(env), atoms::document().encode(env))
.map_err(to_custom_error)?
.map_put(atoms::line_no().encode(env), node.line_no.encode(env))
.map_err(to_custom_error),
Now if I call Html5ever.flat_parse(html)
, the output will contain the line_no:
%{
0 => %{id: 0, line_no: 1, parent: nil, type: :document},
1 => %{
attrs: [],
children: [2, 27, 28],
id: 1,
line_no: 1,
name: "html",
parent: 0,
type: :element
},
2 => %{
attrs: [],
children: [3, 4, 5, 7, 8, 9, 26],
id: 2,
line_no: 2,
name: "head",
parent: 1,
type: :element
},
// ...
}
I did not create a PR for Html5ever repo because this change will break quite a lot of other things and I don't have enough knowledge/experience to work on fixing it all. But I wanted to give you a head-start if/when you decide to implement this. Html5ever does not expose column details. It only exposes line numbers.
I hope this helps! This was fun as I had to learn some Rust and was able to create a separate NIF for a CSS inliner as well. All in all, a good thing to have worked on :D
Hi!
I am trying to find an HTML tokenizer for Elixir that can also provide me with line number of the matching tag. I see that
floki_mochi_html
has#decoder{offset}
record and there are also references toINC_COL
in the code base. I could have tried to extract this information on my own but I am not well-versed with Erlang. Do you think it is possible to expose this information from Floki?This is probably going to require changes to the data structure. Maybe
flat_parse
could contain this additional information?Please let me know if this is doable and even if this is not a good fit for Floki, I would love to hear your suggestions of how I could go about implementing this on my own.
Just for some added context, a sample usecase for this could be a tool using Floki that extracts all
a
tags and then lists their line numbers/location in the html document. My own usecase is a bit different but this one is a simpler representative example.