Incremental Crawling - Githubissues

taganaka / polipus

Polipus: distributed and scalable web-crawler framework

MIT License

92 stars 32 forks source link

Incremental Crawling #10

Closed nengine closed 10 years ago

nengine commented 10 years ago

Hi, I have a setup like below and works fine for the first time. All the pages are crawled as intended. However, when I run for the second time in an anticipation of getting new updates, it would hang and got an message indicating that it is "already stored". When I set the cleaner option to true then it wipes out the entire database and starting from scratch so this is what I like to avoid. Obviously the page is already stored, however shouldn't it still look for the updates?

Polipus::Plugin.register Polipus::Plugin::Cleaner, reset:false
starting_urls = ["http://www.abc.com/home/"]

[worker #0] Page [http://www.abc.com/home/] already stored.

taganaka commented 10 years ago

Not implemented yet. Working on a set of new features where you can specify an expire time for each page. Whether the page is expired, it will downloaded again, by overwriting the old stored doc

As for now a working alternative could be to clean redis-bloomfilter data and override the storage.exists? in your storage adapter to return false even though the page already exists:

https://github.com/taganaka/polipus/blob/master/lib/polipus/storage/mongo_store.rb#L32-L37

module Polipus
  module Storage
    class MongoStore < Base
      def exists? page
        false
      end
    end
  end
end

taganaka commented 10 years ago

@neuralnw You can try this branch:

https://github.com/taganaka/polipus/tree/incremental_crawling

Major changes:

An url part of the initial seeder is always re-downloaded, no matter what
A new option is now available:

# Page TTL: mark a page as expired after ttl_page seconds
# Default: nill
:ttl_page => 60

Mongo store only: page.fetched_at is populated by BSON::ObjectId(_id).generation_time only if page.fetched_at is nil (for legacy data)

Going to merge in the next few days

nengine commented 10 years ago

Hi Thanks, I tested and it is working great. Regards.