postmodern / spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
MIT License
805 stars 109 forks source link

path conflicts with opaque (URI::InvalidURIError) #66

Closed mustiikhalil closed 6 years ago

mustiikhalil commented 6 years ago

I'm trying to crawl stackoverflow but the crawler keeps on giving me this error. apparently the problem is happening whenever it reaches the following link

I'm not sure how to fix it. since "subject=Stack%20Overflow%20Question&body=Time%20series%20speed%20forecasting%20using%20regression%20with%20exogenous%20variables%0Ahttps%3a%2f%2fstackoverflow.com%2fq%2f49618734%3fsem%3d2"

Traceback (most recent call last): 21: from main.rb:4:in

' 20: from /Users/mustafakhalil/Projects/Senior/crawler/crawler.rb:20:in start_crawling' 19: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/spidr.rb:53:insite' 18: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:274:in site' 17: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:355:instart_at' 16: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:373:in run' 15: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:665:invisit_page' 14: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:599:in get_page' 13: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:788:inprepare_request' 12: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:605:in block in get_page' 11: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/agent.rb:679:inblock in visit_page' 10: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:238:in each_url' 9: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:188:ineach_link' 8: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:in each' 7: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:189:inupto' 6: from /usr/local/lib/ruby/gems/2.5.0/gems/nokogiri-1.8.2/lib/nokogiri/xml/node_set.rb:190:in block in each' 5: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:189:inblock in each_link' 4: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:182:in block in each_link' 3: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:239:inblock in each_url' 2: from /usr/local/lib/ruby/gems/2.5.0/gems/spidr-0.6.0/lib/spidr/page/html.rb:283:in to_absolute' 1: from /usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:822:inpath=' /usr/local/Cellar/ruby/2.5.0_2/lib/ruby/2.5.0/uri/generic.rb:766:in check_path': path conflicts with opaque (URI::InvalidURIError)

dharls36 commented 6 years ago

I'm getting a similar error when crawling a site.

Along the lines of;

Failure/Error raise InvalidURIError, "path conflicts with opaque"

mustiikhalil commented 6 years ago

you can clone master in the gem file and it would work perfectly

postmodern commented 4 years ago

Finally fixed in 0.6.1.