taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

QueueOverflow refactoring #45

Closed taganaka closed 9 years ago

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.5%) when pulling 12feef11f2c26f81be5fe0aaf42b967e4909f86b on overflow_items_controller_ref into 0e4b19adbce8d408994ce83b43ea6fe148821a55 on master.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.5%) when pulling 12feef11f2c26f81be5fe0aaf42b967e4909f86b on overflow_items_controller_ref into 0e4b19adbce8d408994ce83b43ea6fe148821a55 on master.

coveralls commented 10 years ago

Coverage Status

Coverage decreased (-0.32%) when pulling 5bebd9af26c283b7bec2799f99b066b423728657 on overflow_items_controller_ref into 0e4b19adbce8d408994ce83b43ea6fe148821a55 on master.

tmaier commented 10 years ago

I think the overflow manager would be a perfect match for a plugin. According to the current architecture, it would start on on_crawl_start and finish (kill) on on_crawl_end.

Second, shouldn't we call thread.exit as you did it already in commit 68d00fafa81e9f2471dc2242925d52df2396c590?

taganaka commented 10 years ago

I'm not still 100% convinced plugins are really needed. Here is a tentative to have the plugins architecture more clean and practical:

https://github.com/taganaka/polipus/compare/plugins

exposing current plugin hooks as public methods as we do for others DSL methods might remove the need of having plugins at all

At this point plugins are just simple class where an instance of Polipus is passed to the initializer and then specific blocks of codes are added to the exposed methods.

Thread.exit should not be needed here. Thread is not joined in the main thread, thus when the main thread is terminated, also all of the other threads will be killed

tmaier commented 10 years ago

For me there is a strong use case for plugins, as the options list is to long in my opinion and the number of methods in PolipusCrawler is to high. I would give it access to the instance of PolipusCrawler and the current instance of the Worker. Could you open a [WIP] pull request for the plugins branch so that we can discuss it there?

I have to admit, I don't have enough experience with threads in ruby. But just because I quit PolipusCrawler#takeover does not mean I quit the main process. Even though #takeover is done with it's job, the overflow manager would be still running, right?

So in some cases (e.g. rake tasks, pry/irb console, maybe tests), multiple threads with an overflow manager could be still running. This would be edge cases, of course.

tmaier commented 10 years ago
def do_you_job(name)
  while true
    puts "#{name} is still working"
    sleep 1
  end
end

def takeover
  Thread.new { do_you_job("Overflow Manager") } 

  workers =
    3
      .times
      .map do |worker_number|
        Thread.new do
          puts "Worker #{worker_number} starting crawl session..."
          sleep 3
          puts "Worker #{worker_number} finishing crawl session..." 
        end
      end
  sleep 10
  puts '10 Seconds are over. Joining.'
  workers.join
end

takeover
takeover
takeover

would result in

Overflow Manager is still working
Worker 0 starting crawl session...
Worker 1 starting crawl session...
Worker 2 starting crawl session...
Overflow Manager is still working
xOverflow Manager is still working
Worker 1 finishing crawl session...
Worker 2 finishing crawl session...
Worker 0 finishing crawl session...
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
10 Seconds are over. Joining.
Worker 2 starting crawl session...
Overflow Manager is still working
Worker 0 starting crawl session...
Worker 1 starting crawl session...
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Worker 2 finishing crawl session...
Worker 0 finishing crawl session...
Worker 1 finishing crawl session...
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
10 Seconds are over. Joining.
Worker 2 starting crawl session...
Overflow Manager is still working
Worker 0 starting crawl session...Worker 1 starting crawl session...

Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Worker 0 finishing crawl session...Worker 2 finishing crawl session...
Worker 1 finishing crawl session...

Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
Overflow Manager is still working
10 Seconds are over. Joining.

Which is approx. 10 times Overflow Manager is still working for the first takeover, 20 times Overflow Manager is still working for the second takeover and 30 times Overflow Manager is still working for the third takeover.