sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.16k stars 904 forks source link

100% Cpu usage on parse html with a lot of inline css #2020

Closed rusikf closed 4 years ago

rusikf commented 4 years ago

Describe the bug Hi, if I use nokogiri with big html where 90% is inline css it cause 100% cpu usage

To Reproduce

#! /usr/bin/env ruby

require 'nokogiri'
require 'net/http'

url = URI('https://baliaquaponics.com')
html = Net::HTTP.get_response(url).body

puts "html size", html.size
doc = Nokogiri::HTML.parse(html)

puts 'OK'

Expected behavior

Not to have cpu usage 100%

Environment `# Nokogiri (1.10.9)

warnings: []
nokogiri: 1.10.9
ruby:
  version: 2.6.4
  platform: x86_64-linux
  description: ruby 2.6.4p104 (2019-08-28 revision 67798) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
  libxslt_path: "/home/rusikf/.rvm/gems/ruby-2.6.4/gems/nokogiri-1.10.9/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  - 0004-libxml2.la-is-in-top_builddir.patch
  - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
  libxslt_patches: []
  compiled: 2.9.10
  loaded: 2.9.10

`

This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.

Additional context The problem is fixed by hack - removing inline css from html before parse: html.gsub!(/<style((.|\n|\r)*?)<\/style>/, '')

flavorjones commented 4 years ago

Hi @rusikf, thanks for reporting, and sorry you're having trouble. I'll try to take a look shortly.

flavorjones commented 4 years ago

OK, I got some time this morning to look into this.

The summary: that you're describing performance characteristics of libxml2 (the underlying parsing library used by Nokogiri) and there's nothing we can easily do to change this behavior.

I've posted a gist with all the code/scripts/profiling so these results can be reproduced: https://gist.github.com/flavorjones/fd27b0f62dd08812d830b82fbe5477f0

First, the baseline: running a simple ruby script using Nokogiri to parse the example document:

$ ruby ./foo.rb
       user     system      total        real
  3.725381   0.003732   3.729113 (  3.729237)

Next, reproducing this result in C calling libxml2 directly (that is, no Ruby or Nokogiri involved):

$ time ./foo
3808 ms

real    0m3.811s
user    0m3.802s
sys 0m0.008s

Great! This shows that Ruby/Nokogiri isn't significantly slower than calling libxml2 from C directly. Let's see what it's doing by using gperftools against the C executable:

image

flavorjones commented 4 years ago

However, what's interesting is that the above is with the vendored libxml v2.9.10; but running this same code against libxml v2.9.4 (which is my local system's distro version), the code runs in about 1/3 of this time:

$ time ./foo
1010 ms

real    0m1.015s
user    0m1.010s
sys 0m0.004s

And the call graph is different:

image

flavorjones commented 4 years ago

OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2.

rusikf commented 4 years ago

Ok , cool !I deleted style tags with regexp - as a quick fix - works without high CPU.

On Fri, 17 Apr 2020 19:55 Mike Dalessio, notifications@github.com wrote:

OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/2020#issuecomment-615355079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXQUOYMUAQF23EK62GZZELRNCCXTANCNFSM4MCECAKQ .

flavorjones commented 4 years ago

OK, so this problem is exacerbated by the problem described in new issue #2022 which is that compiler optimization is not turned on when building the vendored libraries.

Closing this for now, since you have a workaround. Another workaround would be to use your distro's system libraries (see nokogiri.org installation docs at https://nokogiri.org/tutorials/installing_nokogiri.html).

Please watch #2022 for the permanent fix.

ilyazub commented 4 years ago

Another workaround is to pass CFLAGS="-O2" environment variable while installing nokogiri. That's the same as #2022 until it's done.

gem uninstall nokogiri
CFLAGS="-O2" bundle install

It works because CFLAGS are passed here and there in ext/nokogiri/extconf.rb.