serpapi / nokolexbor

High-performance HTML5 parser for Ruby based on Lexbor, with support for both CSS selectors and XPath.
218 stars 4 forks source link

Provide the line number of a node? (particularly an attribute) #8

Closed jaredcwhite closed 1 year ago

jaredcwhite commented 1 year ago

This may not be straightforward (I'm not familiar with the underlying parser engine), but I would love to be able to get the original line number of a node as it had been parsed. (And in my particular case, I would love the line number of an attribute.) This is useful in cases where content within the source document is being processed programmatically, and if there's an issue the user can be notified where exactly in the source document the problem originates.

For reference, Nokogiri provides this feature: https://www.rubydoc.info/github/sparklemotion/nokogiri/Nokogiri%2FXML%2FNode:line

zyc9012 commented 1 year ago

I don't think Lexbor supports this natively but I'll give it a try.

zyc9012 commented 1 year ago

Hi @jaredcwhite. I have added a new API source_location to Node (at master 56f89c295b6f0796adf5b6b3e5e41124fadbec87). Not only line but also column can be calculated from it. For example,

source = <<-HTML
<div class='a'>
  <a class='b'>
    123
  </a>
</div>
HTML

doc = Nokolexbor::HTML(source)
attr = doc.at_css('a').attribute('class')
loc = attr.source_location

line = source[0..loc].count("\n")
# => 1
column = loc - (source[0..loc].rindex("\n") + 1)
# => 5

It's not been released yet as it's only a draft now. You can try it by downloading the gem from this link, extract the zip and install the .gem manually.

gem uninstall nokolexbor
gem install /path/to/the/gem

Let me know if it's what you want.

jaredcwhite commented 1 year ago

@zyc9012 This is awesome! I would definitely be able to make good use of this.

zyc9012 commented 1 year ago

@jaredcwhite Released 0.4.1

jaredcwhite commented 1 year ago

Thanks @zyc9012! One issue I ran into however is that when a node gets cloned, it and all child nodes lose their source location (gets reset to zero). In Nokogiri a cloned node still contains its original line number which was what I had been using before…

zyc9012 commented 1 year ago

@jaredcwhite Released 0.4.2

jaredcwhite commented 1 year ago

Awesome, works great!