taganaka / polipus

Polipus: distributed and scalable web-crawler framework
MIT License
92 stars 32 forks source link

Incremental Crawling #10

Closed nengine closed 10 years ago

nengine commented 10 years ago

Hi, I have a setup like below and works fine for the first time. All the pages are crawled as intended. However, when I run for the second time in an anticipation of getting new updates, it would hang and got an message indicating that it is "already stored". When I set the cleaner option to true then it wipes out the entire database and starting from scratch so this is what I like to avoid. Obviously the page is already stored, however shouldn't it still look for the updates?

Polipus::Plugin.register Polipus::Plugin::Cleaner, reset:false
starting_urls = ["http://www.abc.com/home/"]

[worker #0] Page [http://www.abc.com/home/] already stored.
taganaka commented 10 years ago

Not implemented yet. Working on a set of new features where you can specify an expire time for each page. Whether the page is expired, it will downloaded again, by overwriting the old stored doc

As for now a working alternative could be to clean redis-bloomfilter data and override the storage.exists? in your storage adapter to return false even though the page already exists:

https://github.com/taganaka/polipus/blob/master/lib/polipus/storage/mongo_store.rb#L32-L37

module Polipus
  module Storage
    class MongoStore < Base
      def exists? page
        false
      end
    end
  end
end
taganaka commented 10 years ago

@neuralnw You can try this branch:

https://github.com/taganaka/polipus/tree/incremental_crawling

Major changes:

# Page TTL: mark a page as expired after ttl_page seconds
# Default: nill
:ttl_page => 60

Going to merge in the next few days

nengine commented 10 years ago

Hi Thanks, I tested and it is working great. Regards.