stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Feature request: Stop crawl at time #54

Open samnissen opened 7 years ago

samnissen commented 7 years ago

Hello -- this looks like a great crawler, but I need a way, when crawling, to max-out crawl times on a per-url basis.

Because of that I recommend two features:

Actually raise exceptions

This would allow me to decide any arbitrary conditions upon which to stop crawling.

require 'cobweb'
require 'securerandom'

def condition
  true if SecureRandom.hex(10).include?("a") # or whatever condition I deem relevant
end

CobwebCrawler.new(:raise_arbitrary_exceptions => true).crawl("http://pepsico.com") do |page|
  puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
  raise MyCustomError, "message" if condition
end
Just crawled http://www.pepsico.com/ and got a status of 200.
# ... eventually condition is met ...
MyCustomError: message
        from (somewhere):3
# ...

Encode crawl stop options

This would be a higher level way of enshrining these as features, and would be a lot cleaner overall.

require 'cobweb'

pages = 0
puts Time.now #=> 2017-04-19 13:33:11 +0100 

CobwebCrawler.new(:max_pages => 1000, :max_time => 360).crawl("http://pepsico.com") do |page|
  pages += 1
end
puts "Stopped after #{pages} pages at #{Time.now}"
#=> Stopped after 1000 pages at 2017-04-19 13:36:25 +0100
# (... or some other time that is not more than 360 seconds from start time)

Ideally :max_time would accept DateTime, Time or Integer objects, where the integer would represent seconds.

I'm totally new to this project, so feel free to let me know if these are crazy requests. I'm happy to help make this too, if you can give me a pointer as to where this would start out.

svenaas commented 4 years ago

We would benefit from :max_pages or :max_time options, especially in development and test environments.

stewartmckee commented 3 years ago

If the exception is raised then would you want the whole crawl to stop at that point?

I think you get the same from max_pages by submitting crawl_limit, there is also a crawl_limit_by_page boolean which i think is false by default. crawl_limit is the max number of urls, and if crawl_limit_by_page is set to true then the crawl_limit only applies to text/html content.

Like the idea of max_time though, hadn't thought of that before, thinking that would set a datetime and include that date into the within_crawl_limits to check if it has passed, so could also consume a stop_at datetime. max_time would just do the arithmetic for you.

samnissen commented 3 years ago

Yes, I think raising the error, breaking, or returning should stop the crawl as the default.

Wasn't aware of the crawl_limit – will check that out thank you.

As for max_time, I'm thinking that would probably be an integer, whereas something like stop_at could be a datetime.