Many Application Instances: What should I expect?

Rikaelus commented 6 years ago

I'm still trying to wrap my head around a lot of Python's and Flasks' deeper workings but some initial prototyping with APScheduler has me concerned about about how scheduled tasks will play out in a distributed application environment.

The first encounter was with Flask's debug/reload functionality triggering the scheduler twice. I handled that with based on this solution: https://stackoverflow.com/questions/9449101/how-to-stop-flask-from-initialising-twice-in-debug-mode

But now I'm coming up on things that are harder to test in my development environment. This isn't meant as a substitute for that level of testing, but more so to get an idea what I should expect to see from it. I need a baseline.

First is Gunicorn. The application needs to be multithreaded to serve a high volume of requests without them queuing up as they would with the Flask development server. Gunicorn is one of the standards and it will fork the Flask application multiple times. In this case 8.

Then, for high availability, we have dual staging/production hosts that are load balanced. That means the application running 8 times on two different hosts into the same MySQL database, which is being used as the JobStore.

All of our jobs are defined in our configuration file and that can be tailored per host, but not per fork. The jobs will either be cron-based or interval-based with unique IDs, 1 max instance, and 'replace_existing' as True.

I think we'll be safe with those settings given the last initiated instance would simply replace any existing jobs of the same unique ID, preventing duplication, but I'm not sure if there's any other foreseeable effects of multiple schedulers working off the same configuration and same jobstore.

Any guidance would be appreciated.

Rikaelus commented 6 years ago

Note that this is effectively talked about here, as well: https://github.com/agronholm/apscheduler/issues/160

Does flask-apscheduler compensate for that in any way? I would think it would have to as Flask is supposed to be behind something like Gunicorn, and its internal single-thread server is only meant for easy development access.

viniciuschiele commented 6 years ago

Hi @Rikaelus,

Same issues you are facing with APScheduler you will face with Flask-APScheduler.

Flask-APScheduler just gives you a HTTP interface to add/get/remove/update tasks in APScheduler's job store, nothing else.

APScheduler doesn't prevent from running multiple times the same task in a distributed environment.

To achieve that, you need to find a way to start the scheduler only in one of the instances and use the others only to receive the HTTP requests to manage your tasks. That also means only one instance/server will have to process all your tasks, it could be an issue if you have a high workload.

We usually host this kind of application with uwsgi legion, it keeps an app running on multiple machines but only one is active, if the master node goes offline, it will run the app in another node automatically, we also enable multiple threads to process the incoming requests.

I hope it helps you.

bendemott commented 6 years ago

@agronholm Curious if you would welcome a potential improvement or feature added to apscheduler. I have a lot of experience with Zookeeper and Solr, Cassandra; so I'm no stranger to distributed workloads.

I think this could be fairly easily solved by using the backing-store (whatever driver it is) to perform leader election. As many separate processes on separate servers could join the apscheduler "cluster" and as long as they were sharing the same backing-store and connection string they would all agree upon a leader. This leader could be in-charge of assigning jobs to a scheduler-worker. If a scheduler-worker went-offline a new scheduler-worker would take over the job for the next iteration. If the leader went offline, a new leader would be elected.

This new mode of operation would have several benefits:

simplify the management of apscheduler with forking-models like uwsgi, nginx, etc
make apscheduler highly available
true workload scalability as more nodes / processes could be added to job scheduling at any time.

Is this a patch/addition you would consider? If so I'm confident I could write something up fairly quickly to experiment with, and that we could discuss further.
If you don't feel apscheduler is the right place for this feature-set I understand that too.

Thanks!

Rikaelus commented 6 years ago

Ben's reply brought my attention back to this and thought I'd toss in a followup to my situation.

I didn't want to limit the scheduler to a single instance. That kind of defeats the purpose of having a redundant environment where you could lose a thread, process, server, or entire POP. What I did instead was set up a centralized job tracking DB. All application instances would see a scheduled process, all would try to start it, but only one would claim the job. I put a few second buffer around it to account for server clock differences and a random sleep modifier to ensure a better distribution. I also have another process run every 60 seconds that looks for any jobs that have timed out or otherwise failed and initiates a re-run attempt within a specified window.

It's a complex solution, to be sure, and some of the features APScheduler would normally provide had to shift into my own higher-level code. On the plus side the schedule is now completely databased and can be modified through a UI, and all jobs are recorded. An included "check-in" process even allows some of my jobs to report back with a percentage-complete which I can then present to the admins monitoring the system.

Speaking of Ben's reply, though, his idea sounds great as a potential built-in, lighter-weight solution. It'd definitely be easier for a lot of people to swallow than having to jump over all the hurdles I did.

On Thu, Aug 23, 2018 at 10:10 PM, Ben DeMott notifications@github.com wrote:

@agronholm https://github.com/agronholm Curious if you would welcome a potential improvement or feature added to apscheduler. I have a lot of experience with Zookeeper and Solr, Cassandra; so I'm no stranger to distributed workloads.

I think this could be fairly easily solved by using the backing-store (whatever driver it is) to perform leader election. As many separate processes on separate servers could join the apscheduler "cluster" and as long as they were sharing the same backing-store and connection string they would all agree upon a leader. This leader could be in-charge of assigning jobs to a scheduler-worker. If a scheduler-worker went-offline a new scheduler-worker would take over the job for the next iteration. If the leader went offline, a new leader would be elected.

This new mode of operation would have several benefits:

simplify the management of apscheduler with forking-models like uwsgi, nginx, etc

make apscheduler highly available

true workload scalability as more nodes / processes could be added to job scheduling at any time.

Is this a patch/addition you would consider? If so I'm confident I could write something up fairly quickly to experiment with, and that we could discuss further. If you don't feel apscheduler is the right place for this feature-set I understand that too.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/viniciuschiele/flask-apscheduler/issues/66#issuecomment-415652769, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZBg1Ad4RYadIyHjwO88LRM8M3nGoFdks5uT4rggaJpZM4Snziz .

agronholm commented 6 years ago

I always felt that something this advanced was outside of APScheduler scope, and inside Celery's. But Celery's scheduling capabilities, last I checked, were quite pitiful in comparison to APScheduler's. The change you're suggesting is a major undertaking that would potentially require major changes to APScheduler. But that said, if you can whip up a POC with a relatively small amount of effort, I would like to see that.

viniciuschiele / flask-apscheduler

Many Application Instances: What should I expect? #66