Closed nengine closed 10 years ago
Not implemented yet. Working on a set of new features where you can specify an expire time for each page. Whether the page is expired, it will downloaded again, by overwriting the old stored doc
As for now a working alternative could be to clean redis-bloomfilter data and override the storage.exists?
in your storage adapter to return false even though the page already exists:
https://github.com/taganaka/polipus/blob/master/lib/polipus/storage/mongo_store.rb#L32-L37
module Polipus
module Storage
class MongoStore < Base
def exists? page
false
end
end
end
end
@neuralnw You can try this branch:
https://github.com/taganaka/polipus/tree/incremental_crawling
Major changes:
# Page TTL: mark a page as expired after ttl_page seconds
# Default: nill
:ttl_page => 60
BSON::ObjectId(_id).generation_time
only if page.fetched_at
is nil (for legacy data)Going to merge in the next few days
Hi Thanks, I tested and it is working great. Regards.
Hi, I have a setup like below and works fine for the first time. All the pages are crawled as intended. However, when I run for the second time in an anticipation of getting new updates, it would hang and got an message indicating that it is "already stored". When I set the cleaner option to true then it wipes out the entire database and starting from scratch so this is what I like to avoid. Obviously the page is already stored, however shouldn't it still look for the updates?