Open tmaier opened 10 years ago
:+1: I'm really supportive in refactoring and moving code around as long as it will be helpful for a better maintainability & testing. I'm in general not a huge fan of having everything incapsulated/abstracted/delegated (Matryoshka doll style) just because guides and books says that :)
So let's the topic open.
What about to start to write a WISH/TODO list about features and improvements we envision in the next releases? So that we have a better context on what should be next and how to move forward?
Feel free to add your point in this comment or to cancel a line if you're against it.
Reorder the points freely
PolipusCrawler
which does everything.#on_page_downloaded
blocks.pry
.Polipus::HTTP
with excon (@tmaier)
Polipus
?Very good points! Thanks alot for your thoughts and your help
Going to open a separated issue/thread for each items so that it is easily to keep track of them.
On top of my mind:
storage.exists?(page)
is invoked very often and every time it turns into an hit on DB slowing down the process. A bloom filter added on top of the storage logic might help herestorage.add(page)
performs a single insert/write each time is invoked. We might get some advantages here by implementing the support for bulk writes/inserts whether the underlying driver allows such operations (https://github.com/mongodb/mongo-ruby-driver/wiki/Bulk-Write-Operations)s3_storage
I also don't really know what to do with the current plugin implementation. First of all, a plugin does not have access to the page he is currently processing.
Next, the existing plugins Cleaner
and Sleeper
: For me it is not really obvious why they are plugins and everything else are options of PolipusCrawler
. See polipus.rb#L23-L80
I would propose to allow the plugins access to page and also to move every single configurable feature to the plugin architecture. This way, someone could replace single features with his own implementation or simply get a slimmer crawler when he does not need some of the features provided.
I imagine it like the Middleware of Rack or Sidekiq.
I also don't really know what to do with the current plugin implementation.
Me either :) My initial concept was to create an architecture where user's code could run into polipus scope. But I didn't invest much time. I'm also fine to drop the current implementation and explore a Middleware-like implementation (that actually seems a very good idea!!!)
As a result from #33, I reconsidered the current structure or
PolipusCrawler
.Especially
PolipusCrawler#takeover
is a very long method where lots is going on at the same time.PolipusCrawler
itself has lots of methods and is responsible for everything whats going on in Polipus.I consider this pull request more a proof of concept and a starting point for a discussion.
I would like to move all methods of
PolipusCrawler
to its own classes or plugins so that every class has its own responsibility. For now, I moved most ofPolipusCrawler#takeover
toWorker#run
and split itself again in smaller methods. This would allow a more thorough testing of only specific features of polipus without running the full stack.The delegator used in
Worker
is more a temporary solution.We could allow plugins to hook into
should_be_visited?
and add then a robots plugin, a follow_links_like plugin and a store_pages plugin.A statistics plugin would replace incr_pages and incr_errors and hook into
on_after_download
andon_page_error