kalisio / krawler

A minimalist (geospatial) ETL
https://kalisio.github.io/krawler/
MIT License
54 stars 13 forks source link

Integrate better job sequencers #1

Open claustres opened 6 years ago

claustres commented 6 years ago

Kue will add support for failover and concurrency.

worker-farm might also be used as it is more simple and does not require a side tool like redis.

agenda looks also great and will allow job scheduling.

claustres commented 6 years ago

A big point is to choose if sequencing occurs at the job or the task level.

At job level we will have something more serverless-oriented, where each job might be a complete new krawler instance that can be embedded in a lambda. In this case we should at least allow tasks to be multithreaded.

claustres commented 6 years ago

Started a PoC using kue just as a new job type without multithreading/cluster for now. The main issue we face with clustering is how we share stores between workers because they are created when running the job. However the job should only be run by a single worker to avoid duplication while tasks are dispatched across workers.

Since Redis support under windows by https://github.com/MicrosoftArchive/redis has been discontinued we use https://github.com/tporadowski/redis.

claustres commented 6 years ago

The issue with stores also exists in single-thread mode when job passes the store to task using the store property. Indeed task data are serialized into Redis by Kue causing the store to be lost, e.g. the CLI test does not work with Kue.

claustres commented 4 years ago

Another interesting sequencer is bull, the following gist shows how to use it with a Feathers app.

claustres commented 3 years ago

breejs might also be a good candidate.