stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Error raised when there's a valid <base> tag in <head> #61

Open svenaas opened 3 years ago

svenaas commented 3 years ago

After several years of happy operation our Cobweb-dependent crawler ran into a page at https://sso.cas.org/ where the <head> contains this <base> tag:

<base href="https://sso.cas.org/"/>

Our log file was reporting

Error loading http://our.example.com/url: undefined method `present?' for "https://sso.cas.org/":String

and I believe I've traced the problem to a bug in Cobweb's lib/content_link_parser.rb. In the code

14    if @doc.at("base[href]")
15      @base_url = @doc.at("base[href]").attr("href").to_s if @doc.at("base[href]").attr("href").to_s.present?
16    end

I believe the second line is intended to be:

15      @base_url = @doc.at("base[href]").attr("href").to_s if @doc.at("base[href]").attr("href").present?

though I haven't been under the hood in Cobweb before and may be misunderstanding what you're trying to do.

stewartmckee commented 3 years ago

I'm sorry, i didn't pick this up for some reason.... the to_s is there to make sure the present? function is available. present? is a rails function though, so it may be you're not running within rails. I'm not sure why a rails function is in there as cobweb shouldn't have a dependency on rails. as a workaround, you could try to require the function from ActiveSupport...

require 'active_support/core_ext/string'

which will pull in the present? function I believe. I'll look to remove this dependency shortly.

svenaas commented 3 years ago

Thanks, Stewart! Our Cobweb-related code isn't a Rails project, but I use Rails often enough that I didn't think of that explanation. If we need to work around this before you eliminate the dependency we can patch String by loading ActiveSupport's extensions as you suggest.

stewartmckee commented 3 years ago

The code for present? is pretty simple, so will probably just write our own version based code in rails.

Stewart McKee Founder, Active Information Design

0141 465 5505 ( tel:0141%20465%205505 ) | stewart@activeinformationdesign.com

https://www.activeinformationdesign.com ( https://www.activeinformationdesign.com/ ) | Skype: stewartmckee ( https://webapp.wisestamp.com/sig_iframe?origin=mac-mail&signature_id=6064032967294976&t=0.7811037459520158# )

( http://www.facebook.com/Active-Information-Design-207310232649007/ ) ( http://www.linkedin.com/company/active-information-design ) ( http://twitter.com/activeinform )

( https://twitter.com/activeinform )  

On Fri, 05 Feb 2021 at 18:28 Sven Aas < Sven Aas ( Sven Aas notifications@github.com ) > wrote:

Thanks, Stewart! Our Cobweb-related code isn't a Rails project, but I use Rails often enough that I didn't think of that explanation. If we need to work around this before you eliminate the dependency we can patch String by loading ActiveSupport's extensions as you suggest.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub ( https://github.com/stewartmckee/cobweb/issues/61#issuecomment-774207787 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAAOQEV23FKDR3AV24J7ANTS5Q2FTANCNFSM4TLTFGGA ).