philss / floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.
https://hex.pm/packages/floki
MIT License
2.07k stars 156 forks source link

Is it possible to get line/column number of a tag? #532

Open yasoob opened 9 months ago

yasoob commented 9 months ago

Hi!

I am trying to find an HTML tokenizer for Elixir that can also provide me with line number of the matching tag. I see that floki_mochi_html has #decoder{offset} record and there are also references to INC_COL in the code base. I could have tried to extract this information on my own but I am not well-versed with Erlang. Do you think it is possible to expose this information from Floki?

This is probably going to require changes to the data structure. Maybe flat_parse could contain this additional information?

Please let me know if this is doable and even if this is not a good fit for Floki, I would love to hear your suggestions of how I could go about implementing this on my own.

Just for some added context, a sample usecase for this could be a tool using Floki that extracts all a tags and then lists their line numbers/location in the html document. My own usecase is a bit different but this one is a simpler representative example.

philss commented 9 months ago

Hi, @yasoob!

This is an interesting idea. I think it is not possible with the current data structure, as you said, but could work if we have a more "complete" tree representation like we do internally today - see Floki.HTMLTree. We already started to discuss a "wrapper" around the results in #457, and I think that wrapper could be this tree, which could include more information like the position of a tag.

However I'm not sure the amount of work required if we decide to expose that from the floki_mochi_html tokenizer. I will investigate. But I would say this is feasible, yeah.

Just an additional context: the Mochiweb parser is not the most aligned with the specs 😅 So I'm afraid it could contain wrong data about the position of the elements. That said, I started working in a new parser a long time ago, but this was never finished. I think the correct path - after exposing this in the "wrapper" - would be to finish the parser that is aiming to work according to HTML specs. This could take some time, though.

yasoob commented 9 months ago

So as a fun experiment I spent some time yesterday looking into it. I wanted to get the line numbers of the:

  1. Start tags
  2. End tags
  3. Attributes

I ended up updating the tokenize function like this:

tokenize(B, S = #decoder{offset = O}) ->
      case B of
        %% ... Truncated ...
        <<_:O/binary, "</", _/binary>> ->
            {Tag, S1} = tokenize_literal(B, ?ADV_COL(S, 2)),
            {S2, _} = find_gt(B, S1),
            {{end_tag, Tag, {line_no, S#decoder.line}}, S2};
        <<_:O/binary, "<", C, _/binary>> when
            ?IS_WHITESPACE(C); not ?IS_LETTER(C)
        ->
            %% This isn't really strict HTML
            {{data, Data, _Whitespace}, S1} = tokenize_data(B, ?INC_COL(S)),
            {{data, <<$<, Data/binary>>, false}, S1};
        <<_:O/binary, "<", _/binary>> ->
            {Tag, S1} = tokenize_literal(B, ?INC_COL(S)),
            {Attrs, S2} = tokenize_attributes(B, S1),
            {S3, HasSlash} = find_gt(B, S2),
            Singleton = HasSlash orelse is_singleton(Tag),
            {{start_tag, Tag, Attrs, Singleton, {line_no, S#decoder.line}}, S3};
        _ ->
            tokenize_data(B, S)
    end.

I did something similar for the attributes and added line numbers there as well. So if I directly use this new tokenize function like this:

:floki_mochi_html.tokens(doc)

It produces such output:

{:start_tag, "style", [{"type", "text/css", {:line_no, 13}}], false,
   {:line_no, 13}},
  {:end_tag, "style", {:line_no, 68}},

I checked the line numbers in the output and they were correct. But as you can imagine, this output can't really be used for any further processing as all other functions expect a different data structure. This is a long winded way of saying that it is not only feasible but works correctly as well in the scenarios that I tested.

As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example? This would help me see if it breaks the kind of work I am trying to do. I don't really care for the final HTML output to be "correct". As in, I don't want Mochiweb to add a missing tag in the final output to make it compliant. But I do want it to accurately tokenize what is present in the input. I actually want the broken output where the tags that are missing in the source are also missing in the tokenized output. This would have been much easier to implement if we had a low level tokenizer in Elixir but mochiweb is what we have.

I had previously tried to add this support in the html5ever NIF as it also calls an internal method to update the line number during parsing/tokenizing according to this issue. I managed to get as far as getting a line number printed in the terminal but it wasn't super reliable and my rust is very "rusty". I doubt I can get anywhere with that solution without learning more Rust. Maybe you or someone else who has more Rust experience can look into it.

If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.

I also wasn't aware of the HTMLTree. I will look into it.

philss commented 9 months ago

I checked the line numbers in the output and they were correct [...] it is not only feasible but works correctly as well in the scenarios that I tested.

This is awesome! :D

As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example?

I can say that most of our bugs are related to lack of support from our current parser. There is one example that can affect your output: multiple whitespace chars are collapsed to just one. So if you have multiple new lines, I think it is going to count incorrectly (I didn't try with your patch).

Maybe you or someone else who has more Rust experience can look into it.

If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.

I will take a look when I can!

I also wasn't aware of the HTMLTree.

Thinking now, I guess we would need to change the parsing to build the HTMLTree directly, instead of building the tree as structs like is today.

I cannot promise to add the feature soon, but I will look forward to work on this. Also, if you feel comfortable, don't hesitate to sending PRs. They are more than welcome!

yasoob commented 9 months ago

So I spent some time on this and was able to get the line number from Html5ever as well with the following changes:

  1. Add line_no field to the Node struct and take it as an input while creating a new Node:
pub struct Node {
    id: NodeHandle,
    line_no: u64,
    children: PoolOrVec<NodeHandle>,
    parent: Option<NodeHandle>,
    data: NodeData,
}

impl Node {
    fn new(id: usize, line_no: u64, data: NodeData, pool: &Vec<NodeHandle>) -> Self {
        Node {
            id: NodeHandle(id),
            parent: None,
            children: PoolOrVec::new(pool),
            line_no: line_no,
            data,
        }
    }
}
  1. Add a current_line field in the FlatSink struct and set it to 1 when creating the FlatSink:
pub struct FlatSink {
    pub root: NodeHandle,
    pub nodes: Vec<Node>,
    pub pool: Vec<NodeHandle>,
    pub current_line: u64,
}

impl FlatSink {
    pub fn new() -> FlatSink {
        let mut sink = FlatSink {
            root: NodeHandle(0),
            nodes: Vec::with_capacity(200),
            pool: Vec::with_capacity(2000),
            current_line: 1,
        };

        // Element 0 is always root
        sink.nodes
            .push(Node::new(0, 1, NodeData::Document, &sink.pool));

        sink
    }

    // ... trunc ...
}
  1. Keep a current_line pointer during parsing by implementing the set_current_line method of the TreeSink. This method is called by html5ever whenever html5ever moves to a new line during parsing:
impl TreeSink for FlatSink {
    // ... trunc ...
    fn set_current_line(&mut self, line_number: u64) {
        self.current_line = line_number;
    }
}
  1. Update the make_node method of the FlatSink and populate the line_no field of the Node while creating a new Node:
impl FlatSink {
    // .. trunc ...
    pub fn make_node(&mut self, data: NodeData) -> NodeHandle {
        let node = Node::new(self.nodes.len(), self.current_line, data, &self.pool);
        let id = node.id;
        self.nodes.push(node);
        id
    }
}
  1. Encode the line_no field as well for each node in the encode_node function:
// Do this for all Node types:

NodeData::Document => map
            .map_put(atoms::type_().encode(env), atoms::document().encode(env))
            .map_err(to_custom_error)?
            .map_put(atoms::line_no().encode(env), node.line_no.encode(env))
            .map_err(to_custom_error),

Now if I call Html5ever.flat_parse(html), the output will contain the line_no:

%{
  0 => %{id: 0, line_no: 1, parent: nil, type: :document},
  1 => %{
    attrs: [],
    children: [2, 27, 28],
    id: 1,
    line_no: 1,
    name: "html",
    parent: 0,
    type: :element
  },
  2 => %{
    attrs: [],
    children: [3, 4, 5, 7, 8, 9, 26],
    id: 2,
    line_no: 2,
    name: "head",
    parent: 1,
    type: :element
  },
 // ...
}

I did not create a PR for Html5ever repo because this change will break quite a lot of other things and I don't have enough knowledge/experience to work on fixing it all. But I wanted to give you a head-start if/when you decide to implement this. Html5ever does not expose column details. It only exposes line numbers.

I hope this helps! This was fun as I had to learn some Rust and was able to create a separate NIF for a CSS inliner as well. All in all, a good thing to have worked on :D