themotte / rDrama

This code runs https://www.themotte.org. Forked from https://github.com/Aevann1/rDrama
GNU Affero General Public License v3.0
25 stars 31 forks source link

post scheduling #554

Closed justcool393 closed 1 year ago

justcool393 commented 1 year ago

scheduler: add a rudimentary post scheduler (implements #371).

this pull request adds a rudimentary post scheduler which can be used to schedule posts such that they happen repeatedly.

a treatise on the code that i've written to make work follows.

much of the code written for this commit was reorganizing the project such that submissions can be submitted from outside the /submit route handler. much of the code essentially assumed that this was the only way that submissions would ever be created, so the contingency of this might be needed was not considered.

i'll discuss imports, the submission model and routing, and then, and only then, i'll be able to discuss the scheduler.

imports

attempting this exposed some nasty import loops that would find very incredible and interesting ways to crash the entire application. in particular, app, db_session, were in particular prone to getting stuck in such loops.

for example, a constant from the files.helpers.const module was needed (in order to avoid importing routes unnecessarily at startup, something that would cause unneedful queries to stuff like the leaderboard thread).

attempting to load a constant from main was impossible however because const.py required main (in particular the db_session object) because it wanted to open a connection to the database and load emojis from the database. the side effects of loading stuff from a database, at import time, with in a file that primarily houses primitive constants for everything else is... mildly annoying.

to that end, the app configuration has been split off into 4 files:

  1. files/helpers/config/const.py (actual constants)

  2. files/helpers/config/environment.py (environment vars, db urls, etc)

  3. files/helpers/config/regex.py (regexes, mostly used in sanitize.py)

  4. files/helpers/config/stateful.py (stateful config)

importsimportsimports (#467) intends to probably move app.config.SETTINGS here as well, but the scope for this PR was not to make all of the imports as pretty as possible, but to create recurring scheduled posts. this is pretty much near the minimum that I needed to get the app into a bootable state.

the other big example was anything related to cache invalidation. not all of the import loops were fixed "properly" (that is work for another time, see #467) and in fact imports are done inside functions in order to prevent such loops from occurring.

regardless, this is another one of those things that should be somewhat simple that honestly took hours.

this PR includes a lot of changes that were added at #467. this was out of necessity: many of the changes made the workers unable to boot due to tangly import loops that spread across half the project. more work needs to be done on this; this is the point of #467. the changes that were ported were selected so as to have as little impact as possible (i.e. mostly getting it to a bootable state + a little bit).

submission creation

another issue was that much of the code essentially always expected the code to only create a submission in /submit (and if not, the usual perscription tended to be copypasta with slight modifications). it took a lot of work to prepare the project such that we could do a thing.

a package files.helpers.validators and a dataclass ValidatedSubmissionLike was created to assist with this. i did not decide to attempt to factor out all of the common submission logic as that would be a PR in of itself and is honestly a relatively large undertaking.

the middle ground solution (create a dataclass that does most of the route-handler heavy lifting) is used instead with a convenience function to create a validated version from the Flask request.

getting the submission model and routes to a state where it was simultaneously not dependent on booting the entire chonker of a Flask application, but also having enough richness to be able to perform the tasks necessary to submit a post, was a difficult, yet ultimately surmountable task.

a lot of the change in structure with regards to this came from moving most of the submitting and publishing code outside of the route handler and into the aforementioned validators package.

there is still some work to do in this regard (specifically with regards to sanitize and making that less... stateful dependent) but that is, I believe, out of scope for this PR.

scheduler

the approach used by this pull request uses a handler that is custom. this is probably the most objectionable part of the pull request, and so this is an attempt to justify this decision.

why not procrastinate?

everything it seems in python land wants to be a Flask-like. the pattern of @variable.do_thing() with the function in global scope is relatively awful for reasoning about the state of the system at any given time.

this is problematic in the sense that it becomes very simple to do things that while they look simple on the front matter of docs for Flask and their derivatives, make system state much more shockingly complex. and unlike .NET, where you have to essentially try very hard to cause side effects as the result of attribute definition, decorators in Python (especially in these Flask-likes) have side effects (outside of the function they decorate) with incredible frequency.

this is the case with procrastinate and this is quite unforunate. one alternative, if procrastinate is decided to be used as some point is to completely bypass their public API and add Task objects directly (using Blueprint._register_task(task)). the fact that none of these style of application include this thing in their public API is incredibly frustrating.

another problem with procrastinate is that it may not interact with database migrations nicely. migrations must be ran manually and the initial file is generated with schema.sql, a thing that would likely not play very nice with an app that expects the only changes to the database schema to be made through migrations. i suppose it is possible to have a separate schema just for procrastinate, but this makes interaction and any UI for scheduled tasks difficult.

i really want to like procrastinate, i really do. there are a couple of things that i think would make it difficult for us to use at the moment.

why not celery?

the ideal thing about procrastinate is that it uses postgres as a backend. postgres has very many durability guarantees. it is also ideal as a use since all of our database models use SQLAlchemy and generalized, SQL. this makes it easy to query for tasks and provdes durable storage for such tasks.

redis is alright for storing data we don't want to lose and it works as a fine message queue for celery.

celery does have a SQLAlchemy backend but do not find it stable.

why not cron?

this is the approach rdrama continues to use for their scheduled tasks. cron is relatively reliable as a scheduler, and indeed a lot of development time has been put into it. however, it does have some drawbacks.

cron has no synchronization nor prevention of duplication. which is fine for what it is, synchronizing with an external data source is out of scope for cron, but it is in scope for #371.

in addition, the reliability is questionable. the questionable reliability does not come from cron, as discussed earlier I find that it works perfectly well, but rather from the application itself.

when the cron tasks fail, what ends up happening to them is that it logs an exception to a file that, from my time as a dev for them, was basically like never read. we only found out the tasks were often not working because usually the casino or lottery would break. there was essentially no error reporting other than that.

honestly the logs could've gotten piped to /dev/null and no one would have noticed. it was also hard to debug errors as exceptions wouldn't appear pretty much anywhere except this (as discussed) unseen log file.

why not

i only evaluated a few here. if there's an alarmingly better solution, i'm happy to know and modify the PR. this still does a lot of prep with regards to making posts scheuldable.

goals

the goals set out by #371 were to create a scheduler that doesn't double up. essentially the goal is reliability. we want a scheduler that hopefully can guarantee such reliablity.

essentially it shouldn't repeat, should post what it needs to post every X interval, and ideally should not fall into the other traps that other schedulers use and be easily integratable with our app.

the hope is also that submitting a scheduled post is about as easy as submitting a post (so no hardcoded jinja templates, etc).

implementation

enter the files.classes.cron package, the files.commands.cron module, and the files.routes.admin.tasks module.

the cron module is the entry point and manages the concurrent tasks. we use postgres locks to manage the critical paths, specifically when reading the list of tasks, updating the last run and status, and then re-updating the status once a task has completed.

exceptions are caught and logged (although the logger is rudimentary, it does exist, and any errors are visible in the web UI, more work should be done on this in #220).

the lifecycle is as follows:

  1. acquire an exclusive lock on the repeatable task table. this will block other readers and writers, but guarantees that writes won't happen at an inopportune time.
  2. query for tasks that are enabled, are effectively enabled, and aren't currently running.
  3. release the previous lock.

then for each task:

  1. re-lock the table. update the task so that it is in a running state and also set the last run time.
  2. release the lock.
  3. run the task.
  4. log any exceptions using the logging module.
  5. acquire a lock again.
  6. set the task to a waiting mode.

this should provide a reliable system that could theoretically be ran concurrently. tasks are ran with the tasks table unlocked, so that their statuses can be read easily, quickly, and simply. the database table is only locked as long as it needs to be, which should hopefully make lock contention low and throughput high.

the scheduler module (files.classes.scheduler) contains the data structures needed for the scheduler, including the SQLAlchemy models, a few enums for days of the week, type of task, what state it's in, but most of all... the TaskRunContext.

what is a TaskRunContext, you didn't ask? it's an object that houses state. which initially sounds about as easy to understand as telling you that a TransactionAwarePersistenceManagerFactoryProxy is a proxy for a transaction-aware persistence manager factory, but i promise it's much more exciting (it was not).

anyway the idea is that it stores some metadata about the current time, etc, but it also contains references to the Flask app, the database, the mail extension, and redis.

this lets tasks avoid doing from files.__main__ import app, etc and lets them get references to common things tasks would probably need (for example a database session or a reference to our cache).

it also includes a app_context context manager function which creates an app context with proper g variables set as an interface to a lot of the legacy code that requires an app context (in specific, a lot of code uses g.db to access the database session for example).

this enables compatability but also encourages state to be passed around explicitly. i've also done my best to document most of this so that it can be easily understood later.

tasks are one of two things:

  1. a PythonCodeTask which is essentially 2 strings that reference a python package and a callable from that package (this callable is called with the aforementioned TaskRunContext as a parameter, so it has access to all of the state without needing external imports to main)

  2. a ScheduledSubmissionTask which is a submission template that when the task is ran, will make and publish a submission using this template. templates use python strf formatting in their titles so a date can be appended if desired).

there is also a UI for submitting scheduled posts, viewing them, and modifying them after submission. no code changes need to be made to create, modify, or delete a scheduled post.

due to the possibility of an attack vector, PythonCodeTasks have no web UI that would allow modification and would need to be added to the database manually. (there is a possibility that other task types could be added).

future work

there is some future work that could be done on this. currently the scheduler supports a resolution of at most once per day (although at a predefined time!). it's not an insurmountable limitation, but one i have chosen to accept to keep the implementation basic for now.

scheduler resolution is the big one and even potentially some config accessible from the app on a killswitch for scheduled tasks, etc).

there's also potentially some worth in adding some of the missing submission fields (for example thumbnails are not properly generated) and generally bringing both submissions more in line with each other.

(it's worth noting that some fields were left out in anticipation of

470 and its associated PRs, namely #492 and #548)

there's also unifying some of the comment and post objects (although this is mildly out of scope, #359 is related to unification.)

conclusion

honestly this has been a surprisingly complicated task to take, but i am glad that this is getting done, since this seems to be one of the most desired things so far.

let me know if there's any comments, questions, etc.