omniscrapper / runner

Scrappers runner service for OmniScrapper
MIT License
0 stars 0 forks source link

Parallelize scrapping #6

Open Mehonoshin opened 5 years ago

Mehonoshin commented 5 years ago

It should be up to crawler developer whether he want's to parallelize the process, or not. For example for some social networks it is not good to scrap with many parallel threads.

For we need to come up with proof-of-concept, that allows to parallelize certain pieces with separate sidekiq tasks.

For gallery crawler it makes sense to parallelize each separate page scrapping, to make it faster.

parallelize do |context|
  # some action
end

Passing code to this block should spawn a separate job, that receives all necessary context, for example page url, maybe cookies and so on.

Mehonoshin commented 4 years ago

The question is, how are we going to collect all scrapped data? We need to come up with some synchronization mechanism, or each worker should report its results separately.