Open summera opened 4 years ago
Thanks for the report!
I just reproduced that in my dev environment. I didn't need to bump the timeout to 20, with your first example it took a long to process and I had to interrupt it:
➔ irb
2.7.1 :001 > require 'metainspector'
=> true
2.7.1 :002 > result = MetaInspector.new("https://www.gmct.com.au/harkness", retries: 0, read_timeout: 10, connection_timeout: 10)
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^CTraceback (most recent call last):
14: from /Users/jaime/.rvm/rubies/ruby-2.7.1/bin/irb:23:in `<main>'
13: from /Users/jaime/.rvm/rubies/ruby-2.7.1/bin/irb:23:in `load'
12: from /Users/jaime/.rvm/rubies/ruby-2.7.1/lib/ruby/gems/2.7.0/gems/irb-1.2.3/exe/irb:11:in `<top (required)>'
11: from (irb):2
10: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector.rb:20:in `new'
9: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector.rb:20:in `new'
8: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/document.rb:41:in `initialize'
7: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/document.rb:41:in `new'
6: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/parser.rb:20:in `initialize'
5: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/metainspector-5.10.1/lib/meta_inspector/parser.rb:34:in `parsed'
4: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html.rb:15:in `HTML'
3: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html/document.rb:205:in `parse'
2: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/html/document.rb:205:in `read_memory'
1: from /Users/jaime/.rvm/gems/ruby-2.7.1/gems/nokogiri-1.10.9/lib/nokogiri/xml/document.rb:77:in `initialize'
OK, that URL returns a 15.9 Mb HTML file:
➔ curl https://www.gmct.com.au/harkness > harkness.html
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15.9M 0 15.9M 0 0 1394k 0 --:--:-- 0:00:11 --:--:-- 2180k
It looks like the timeout is not in the request done by Faraday, but later in the processing of the HTML by Nokogiri. In my case, if I give it enough time, it succeeds but it takes up a lot of memory to parse such a large document.
So in a nutshell, I don't see it failing - your first TimeoutError might be a proper HTTP request timeout, while in the other cases it's just that parsing a huge HTML takes a lot of time in Nokogiri.
As a workaround I suggest you encapsulate your calls to MetaInspector in your app in your own timeout blocks; maybe we can think of introducing a new parse_timeout
option with a sane default, to protect from these situations.
What do you think @jschwindt ?
@jaimeiniesta thanks for looking into this! I'll try wrapping our calls to MetaInspector
with a Timeout::timeout
like you suggest. A parse_timeout
option in MetaInspector
would be nice.
It's not a very common case to have such a big html, but as you said a parse_timeout
could protect against this problem.
OK, I think it would be a nice addition to add a parse_timeout
option, I'll be happy to review a PR if anyone wants to work on this.
We've come across a URL (
https://www.gmct.com.au/harkness
) where MetaInspector will get stuck when the timeout is high enough (like >= 20 seconds). For example, if you do:You correctly get a timeout error. However, if you bump that timeout up to
20
, it just hangs:It looks like this site is using
Next.js
and the size of the page is quite large, but it's hard to tell what's making it hang with a larger timeout. @jaimeiniesta any ideas?