Closed rusikf closed 4 years ago
Hi @rusikf, thanks for reporting, and sorry you're having trouble. I'll try to take a look shortly.
OK, I got some time this morning to look into this.
The summary: that you're describing performance characteristics of libxml2 (the underlying parsing library used by Nokogiri) and there's nothing we can easily do to change this behavior.
I've posted a gist with all the code/scripts/profiling so these results can be reproduced: https://gist.github.com/flavorjones/fd27b0f62dd08812d830b82fbe5477f0
First, the baseline: running a simple ruby script using Nokogiri to parse the example document:
$ ruby ./foo.rb
user system total real
3.725381 0.003732 3.729113 ( 3.729237)
Next, reproducing this result in C calling libxml2 directly (that is, no Ruby or Nokogiri involved):
$ time ./foo
3808 ms
real 0m3.811s
user 0m3.802s
sys 0m0.008s
Great! This shows that Ruby/Nokogiri isn't significantly slower than calling libxml2 from C directly. Let's see what it's doing by using gperftools against the C executable:
However, what's interesting is that the above is with the vendored libxml v2.9.10; but running this same code against libxml v2.9.4 (which is my local system's distro version), the code runs in about 1/3 of this time:
$ time ./foo
1010 ms
real 0m1.015s
user 0m1.010s
sys 0m0.004s
And the call graph is different:
OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2.
Ok , cool !I deleted style tags with regexp - as a quick fix - works without high CPU.
On Fri, 17 Apr 2020 19:55 Mike Dalessio, notifications@github.com wrote:
OK, placeholder for further investigation: the ~3x slowdown appears to be correlated with the vendored libraries, not with the version of libxml2.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sparklemotion/nokogiri/issues/2020#issuecomment-615355079, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXQUOYMUAQF23EK62GZZELRNCCXTANCNFSM4MCECAKQ .
OK, so this problem is exacerbated by the problem described in new issue #2022 which is that compiler optimization is not turned on when building the vendored libraries.
Closing this for now, since you have a workaround. Another workaround would be to use your distro's system libraries (see nokogiri.org installation docs at https://nokogiri.org/tutorials/installing_nokogiri.html).
Please watch #2022 for the permanent fix.
Describe the bug Hi, if I use nokogiri with big html where 90% is inline css it cause 100% cpu usage
To Reproduce
Expected behavior
Not to have cpu usage 100%
Environment `# Nokogiri (1.10.9)
`
This output will tell us what version of Ruby you're using, how you installed nokogiri, what versions of the underlying libraries you're using, and what operating you're using.
Additional context The problem is fixed by hack - removing inline css from html before parse:
html.gsub!(/<style((.|\n|\r)*?)<\/style>/, '')