jculvey / roboto

A web crawler/scraper/spider for nodejs
67 stars 24 forks source link

Stop/resume #2

Open f1ames opened 9 years ago

f1ames commented 9 years ago

I think I saw it in the roadmap. It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.

jculvey commented 9 years ago

Yeah, this is really lacking right now.

I've been a little torn over how to implement this. In the long term I think it would be cool if there was some sort of admin UI where you could view previous crawl results, start and stop new crawls, and maybe even do a little configuration.

That might be a little heavyweight for some people, so having a simple pause/resume from the command line would be nice.

How would this change sound:

In the crawler you can configure a queue file:

var crawler = new roboto.Crawler({
  startUrls: [
    "https://news.ycombinator.com/",
  ],  
  queueFile: '/var/foo'
});

Then, the url frontier and set of seen urls will periodically be serialized and flushed out to the file as json.

f1ames commented 9 years ago

Well, I have very similar idea. You can configure queue file and crawler periodically serializes data which is necessary for resume. The flow I was thinking of:

If crawler is done, it removes queueFile so next time it starts from the beginning.