shriphani / pegasus

:racehorse:✈️ Pegasus is a scalable, modular, polite web-crawler for Clojure
http://getpegasus.io
Eclipse Public License 1.0
262 stars 17 forks source link

Restarts and incremental crawls #25

Open shriphani opened 8 years ago

shriphani commented 8 years ago

See the confusion in: #22

ejschoen commented 8 years ago

For what it's worth, I found 2 problems contributing to making restarts fail.

  1. Factual/durable-queue doesn't like to restore queues with names containing periods. So, given that queue names are constructed from keywords of the host portion of urls, this was causing a problem. (The restored queue name for http://foo.org/Some/path was "org".) I resolved the problem by replacing the default enqueue pipeline component with one that substitutes _ for . in queue names.
  2. The pipeline workers aren't restored, because the cache says that the host has been visited, so pegasus.queue/setup-queue-worker isn't called. I rewrote pegasus.core/start-crawl to take! the first entry from the to-visit queue with 0 timeout. If it gets something, it constructs a queue worker for the queue name that will be associated with that url. If it gets nothing, it does the normal seeding.
shriphani commented 8 years ago

Solid. Would really appreciate a PR!

ejschoen commented 8 years ago

Will do. I'm working through a fork of a fork that a colleague made. I'll put some effort into folding the changes into defaults and core.