Open f1ames opened 10 years ago
Yeah, this is really lacking right now.
I've been a little torn over how to implement this. In the long term I think it would be cool if there was some sort of admin UI where you could view previous crawl results, start and stop new crawls, and maybe even do a little configuration.
That might be a little heavyweight for some people, so having a simple pause/resume from the command line would be nice.
How would this change sound:
In the crawler you can configure a queue file:
var crawler = new roboto.Crawler({
startUrls: [
"https://news.ycombinator.com/",
],
queueFile: '/var/foo'
});
Then, the url frontier and set of seen urls will periodically be serialized and flushed out to the file as json.
Well, I have very similar idea. You can configure queue file and crawler periodically serializes data which is necessary for resume. The flow I was thinking of:
If crawler is done, it removes queueFile so next time it starts from the beginning.
I think I saw it in the roadmap. It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.