jaimeiniesta / metainspector

Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...
https://github.com/metainspector/metainspector
MIT License
1.03k stars 165 forks source link

MetaInspector hanging for url when timeout >= 20 #273

Open summera opened 4 years ago

summera commented 4 years ago

We've come across a URL (https://www.gmct.com.au/harkness) where MetaInspector will get stuck when the timeout is high enough (like >= 20 seconds). For example, if you do:

> result = MetaInspector.new("https://www.gmct.com.au/harkness", retries: 0, read_timeout: 10, connection_timeout: 10)
MetaInspector::TimeoutError: execution expired

You correctly get a timeout error. However, if you bump that timeout up to 20, it just hangs:

> result = MetaInspector.new("https://www.gmct.com.au/harkness", retries: 0, read_timeout: 20, connection_timeout: 20)

It looks like this site is using Next.js and the size of the page is quite large, but it's hard to tell what's making it hang with a larger timeout. @jaimeiniesta any ideas?

jaimeiniesta commented 4 years ago

Thanks for the report!

I just reproduced that in my dev environment. I didn't need to bump the timeout to 20, with your first example it took a long to process and I had to interrupt it:

➔ irb
2.7.1 :001 > require 'metainspector'
 => true
2.7.1 :002 > result = MetaInspector.new("https://www.gmct.com.au/harkness", retries: 0, read_timeout: 10, connection_timeout: 10)

^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CTraceback (most recent call last):
       14: from /Users/jaime/.rvm/rubies/ruby-2.7.1/bin/irb:23:in `<main>'
       13: from /Users/jaime/.rvm/rubies/ruby-2.7.1/bin/irb:23:in `load'
       12: from /Users/jaime/.rvm/rubies/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/irb-1.2.3/exe/irb:11:in `<top (required)>'
       11: from (irb):2
       10: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector.rb:20:in `new'
        9: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector.rb:20:in `new'
        8: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/document.rb:41:in `initialize'
        7: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/document.rb:41:in `new'
        6: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/parser.rb:20:in `initialize'
        5: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/parser.rb:34:in `parsed'
        4: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html.rb:15:in `HTML'
        3: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html/document.rb:205:in `parse'
        2: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html/document.rb:205:in `read_memory'
        1: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/xml/document.rb:77:in `initialize'
jaimeiniesta commented 4 years ago

OK, that URL returns a 15.9 Mb HTML file:

➔ curl https://www.gmct.com.au/harkness > harkness.html
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 15.9M    0 15.9M    0     0  1394k      0 --:--:--  0:00:11 --:--:-- 2180k

It looks like the timeout is not in the request done by Faraday, but later in the processing of the HTML by Nokogiri. In my case, if I give it enough time, it succeeds but it takes up a lot of memory to parse such a large document.

So in a nutshell, I don't see it failing - your first TimeoutError might be a proper HTTP request timeout, while in the other cases it's just that parsing a huge HTML takes a lot of time in Nokogiri.

As a workaround I suggest you encapsulate your calls to MetaInspector in your app in your own timeout blocks; maybe we can think of introducing a new parse_timeout option with a sane default, to protect from these situations.

What do you think @jschwindt ?

summera commented 4 years ago

@jaimeiniesta thanks for looking into this! I'll try wrapping our calls to MetaInspector with a Timeout::timeout like you suggest. A parse_timeout option in MetaInspector would be nice.

jschwindt commented 4 years ago

It's not a very common case to have such a big html, but as you said a parse_timeout could protect against this problem.

jaimeiniesta commented 4 years ago

OK, I think it would be a nice addition to add a parse_timeout option, I'll be happy to review a PR if anyone wants to work on this.