sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.16k stars 904 forks source link

[feature] `Node#inner_text` should not capture `<style>` tag contents #2292

Open jaypinho opened 3 years ago

jaypinho commented 3 years ago

Please describe the bug

Per Nokogiri's documentation, the Node#inner_text method (aliased as text and content as well) is meant to capture "the plaintext content for this Node." Given the usage of the method name inner_text, it implies that it works similarly to the JavaScript method of the same name.

However, the JavaScript method explicitly excludes the inner content of any <style> tags that are children of the given node, while Node#inner_text includes it.

Help us reproduce what you're seeing

Example URL (note that you need to curl this link to reproduce the below, not simply examine it in browser dev tools, as runtime JS changes the underlying DOM structure): https://www.binance.com/en/terms

require 'httparty'
require 'nokogiri'
x = HTTParty.get('https://www.binance.com/en/terms').body
y = Nokogiri::HTML.parse(x).at_css("body div main").inner_text

Expected behavior

Expectation: the result should start with Binance Terms of Use...

Actual: it starts with .css-13trade{box-sizing:border-box...

Per the JS docs for innerText,