kostya / lexbor

Fast HTML5 Parser with CSS selectors. This is successor of myhtml and expected to be faster and use less memory.
MIT License
95 stars 14 forks source link

Node#tag_text returns text of descending children #24

Closed z64 closed 2 years ago

z64 commented 2 years ago

When moving some code from myhtml to lexbor, I came across this change in behaivor:

example = <<-HTML
<p><b>Some <a>text 1</a></b></p>
<p><b>Some <a>text 2</a></b></p>
HTML

puts("Myhtml".colorize.blue)
html = Myhtml::Parser.new(example)
html.css("p").each do |node|
  node.walk_tree do |inner_node, level|
    print(" " * level * 2)
    print("#{inner_node.tag_name} -> ".colorize.yellow)
    puts(inner_node.tag_text.colorize.blue)
  end
end

# same thing, but with lexbor:
puts("Lexbor".colorize.red)
html = Lexbor::Parser.new(example)
html.css("p").each do |node|
  node.walk_tree do |inner_node, level|
    print(" " * level * 2)
    print("#{inner_node.tag_name} -> ".colorize.yellow)
    puts(inner_node.tag_text.colorize.red)
  end
end

Output:

Myhtml
p -> 
  b -> 
    -text -> Some 
    a -> 
      -text -> text 1
p -> 
  b -> 
    -text -> Some 
    a -> 
      -text -> text 2
Lexbor
p -> Some text 1
  b -> Some text 1
    _text -> Some 
    a -> text 1
      _text -> text 1
p -> Some text 2
  b -> Some text 2
    _text -> Some 
    a -> text 2
      _text -> text 2

It would appear that tag_text no longer returns only the current nodes text, but includes all children as well, similar to deep options in other methods.

At a glance, this seems like it could be a missing feature or behavior on lexbor, but I'm not certain. In any case, I figured I would start by opening an issue here for other Crystal users.


The workaround is to explicitly check node type:

puts("Lexbor".colorize.red)
html = Lexbor::Parser.new(example)
html.css("p").each do |node|
  node.walk_tree do |inner_node, level|
    print(" " * level * 2)
    print("#{inner_node.tag_name} -> ".colorize.yellow)
    if inner_node.tag_sym == :_text
      puts(inner_node.tag_text.colorize.red)
    else
      puts
    end
  end
end

and that will mimic the same behavior as myhtml.

kostya commented 2 years ago

tag_text is more like private method, it should be called only on nodes node.textable?. I think method inner_text preferable to use.

z64 commented 2 years ago

@kostya I found undesireable behavior with inner_text; the text nodes themselves don't yield anything, instead it gets pushed up to the parent node:

example = <<-HTML
<p><b>Some <a>text 1</a> after 1</b></p>
<p><b>Some <a>text 2</a> after 2</b></p>
HTML

html = Lexbor::Parser.new(example)
html.css("p").each do |node|
  node.walk_tree do |inner_node, level|
    print(" " * level * 2)
    print("#{inner_node.tag_name} -> ".colorize.yellow)
    puts(inner_node.inner_text(deep: false))
  end
end
p ->
  b -> Some  after 1    <--- wrong
    _text ->
    a -> text 1
      _text ->
    _text ->
p ->
  b -> Some  after 2    <--- wrong
    _text ->
    a -> text 2
      _text ->
    _text ->

Compared to:

html = Lexbor::Parser.new(example)
html.css("p").each do |node|
  node.walk_tree do |inner_node, level|
    print(" " * level * 2)
    print("#{inner_node.tag_name} -> ".colorize.yellow)
    if inner_node.textable?
      puts(inner_node.tag_text)
    else
      puts
    end
  end
end
p ->
  b ->
    _text -> Some
    a ->
      _text -> text 1
    _text ->  after 1
p ->
  b ->
    _text -> Some
    a ->
      _text -> text 2
    _text ->  after 2

which allows me to correctly perform the reconstruction I'm doing in the right order.

Is there another way?

kostya commented 2 years ago

both example are ok for me, tag_text just more low level, if need it use it.