stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Code organization #34

Open andrejj opened 9 years ago

andrejj commented 9 years ago

Hi, first I want to say thank you for sharing this crawler and for the work you put in it.

Here is our experience with it and thoughts for improvements. I would be happy to know if you agree and if you would like to get this implemented (we can contribute of course).

We have a repository of code, we use for doing lots of data processing using resque. We tried to use cobweb within our repository and here are our issues:

  1. name conflicts, classes are declared on a global level. Classes declared in cobweb should be name-spaced in a module. Example: Cobweb::Stats
  2. Sinatra loaded by default. We run our code on multiple machines with multiple processes. As I understand sinatra's purpose is to provide a UI for stats. We don't need/want it to be loaded every time on all boxes consuming memory and slowing down the boot time of our app. So this should be optional (example: 'require cobweb-web' or separate gem).
  3. files directive in gemspec. Everything you put in the files directive, can be loaded automatically. This again exposes naming conflicts. For example we use Fozzie that declares Stats module. But when you do 'require stats', you don't know which one is going to be loaded.
  4. sidekick vs resque, could be optional programmers decision and I would avoid auto detection
  5. logging should be configurable and puts statements should not be used. ruby Cobwbeb.logger = Logger.new

In conclusion this is what i have in mind:

require 'cobweb-resque'
# OR
require 'cobweb-sidekick'
require 'cobweb-web' # optional
Cobweb.logger = Logger.new("crawler.log")
stewartmckee commented 9 years ago

Hi Andej,

Agreed. Its kind of grown organically over time. Sounds like its due for a v2. :). Regarding your points,

  1. 110% agreed, should be refactored, there are a number of helper classes too that should just ne part of the main class too.
  2. Agreed on this one, was thinking of spliting to something similar to resque_web. Possibly bundling it as a seperate gem too because all the assets for the gui make the gem too big.
  3. Yes, namespacing and refactoring should solve this hopefully
  4. Yes, could do, thinking about it, explicit declaration of a queue system would be good, especially if your not using one, less dependancies.
  5. Hadn't thought about logging, good suggestion. The debug and quiet options can be removed and output for those can be mapped to a log level.

Cool, so all good suggestions. I've been hesitant to make breaking changes, but will start a v2 branch next time i get a chance. i'll give you a shout once it's there and if you've got any changes fire a pull request to it,

thanks, Stewart.

On 18 Nov 2014, at 08:44, Andrej Jančič notifications@github.com wrote:

Hi, first I want to say thank you for sharing this crawler and for the work you put in it.

Here is our experience with it and thoughts for improvements. I would be happy to know if you agree and if you would like to get this implemented (we can contribute of course).

We have a repository of code, we use for doing lots of data processing using resque. We tried to use cobweb within our repository and here are our issues:

name conflicts, classes are declared on a global level. Classes declared in cobweb should be name-spaced in a module. Example: Cobweb::Stats

Sinatra loaded by default. We run our code on multiple machines with multiple processes. As I understand sinatra's purpose is to provide a UI for stats. We don't need/want it to be loaded every time on all boxes consuming memory and slowing down the boot time of our app. So this should be optional (example: 'require cobweb-web' or separate gem).

files directive in gemspec. Everything you put in the files directive, can be loaded automatically. This again exposes naming conflicts. For example we use Fozzie that declares Stats module. But when you do 'require stats', you don't know which one is going to be loaded.

sidekick vs resque, could be optional programmers decision and I would avoid auto detection

logging should be configurable and puts statements should not be used. ruby Cobwbeb.logger = Logger.new

In conclusion this is what i have in mind:

require 'cobweb-resque'

OR

require 'cobweb-sidekick' require 'cobweb-web' # optional Cobweb.logger = Logger.new("crawler.log") — Reply to this email directly or view it on GitHub.