Open simonw opened 1 year ago
I realize now that I'm not entirely sure what "run stator in-process" even means for an app that's running using 8 gunicorn workers:
Would there need to be a mechanism to have it run in just one of those workers?
My lack of depth of understanding as to how Django actually works when run with gunicorn -w 8
is betraying me here.
So, funny thing, I tried the whole "running it on request-based serverless" thing and it just doesn't work. You have to keep the process alive with a cron job by calling a URL every 60 seconds that lasts for 60 seconds, and I have found literally no platform where that's both reliable and cheaper than even a moderately-expensive server.
Thus, the whole "call it every X seconds" thing is now trashed, and runstator
is now a long-running process, that I suspect could totally be squished next to gunicorn. Not going to do that yet - I'm just running two processes for now, and I'd hope fly can do the same.
Fly will happily run a Dockerfile which starts multiple processes - looking forward to reading your Dockerfile to see which pattern you like for doing that!
So, funny thing, I tried the whole "running it on request-based serverless" thing and it just doesn't work. You have to keep the process alive with a cron job by calling a URL every 60 seconds that lasts for 60 seconds, and I have found literally no platform where that's both reliable and cheaper than even a moderately-expensive server.
Something I don't fully understand is why you need to run every 60s or so.
My incomplete mental model of how ActivityPub works is this:
Based on this, my naive thought was that the only time you need to really do hard work is when you are sending a status. Everything else should hopefully fit into a regular request/response cycle (maybe taking a few seconds longer, but that works fine on stuff like Cloud Run - I've built systems on Cloud Run where an incoming POST request triggered an endpoint that took 30s to finish, and it worked OK).
Other than message delivery, what are the tasks that need to be triggered by workers that run at least once every 60s?
How about if everything happened in regular requests without any cron-style activity at all... and any time a user published a status they got a progress bar showing how many servers it had been delivered to, powered by HTTP polling of their browser to an endpoint?
And if they shut their browser tab too early some more messages would be sent next time a request came in, kind of like how WP-cron works: https://developer.wordpress.org/plugins/cron/
WP-Cron works by checking, on every page load, a list of scheduled tasks to see what needs to be run. Any tasks due to run will be called during that page load. [...] Scheduling errors could occur if you schedule a task for 2:00PM and no page loads occur until 5:00PM.
Another option is to provide (documentation for) an all-in-one gunicorn.conf.py
using the when_ready()
hook to spawn a stator thread.
https://docs.gunicorn.org/en/stable/settings.html#when-ready
Another option is to provide (documentation for) an all-in-one
gunicorn.conf.py
using thewhen_ready()
hook to spawn a stator thread. https://docs.gunicorn.org/en/stable/settings.html#when-ready
Oh that's really interesting, I hadn't seen those options before.
Based on this, my naive thought was that the only time you need to really do hard work is when you are sending a status. Everything else should hopefully fit into a regular request/response cycle (maybe taking a few seconds longer, but that works fine on stuff like Cloud Run - I've built systems on Cloud Run where an incoming POST request triggered an endpoint that took 30s to finish, and it worked OK).
Other than message delivery, what are the tasks that need to be triggered by workers that run at least once every 60s?
Message delivery is a big part - you have to deliver thousands of items, with retries, to servers that may take tens of seconds to respond. Even async makes that hard to do in 30 seconds!
Some other things:
In theory you could try and cram all this into a request-oriented model, but it's going to fall over quite quickly and feel like posts are very slow to appear for everyone. After looking at the options here, I think it's best to go with our tried and trusted friend the worker process if you want reliable, easy-to-maintain software (after all, making it all async is already enough of a stretch given the libraries involved)
you have to deliver thousands of items, with retries, to servers that may take tens of seconds to respond
I forgot about retries! Yeah that totally makes sense that those would need longer running processes - especially since presumably they're really important, since if you don't keep retrying then the people on that server will never see your latest posts (unlike if it was a pull mechanism similar to RSS where it's up to them to backfill stuff they missed later on).
I still think something like the WP-cron model could be interesting, just because ActivityPub seems to be quite forgiving for things that are delayed by a chunk of time. Each time anything hits a server could be an opportunity to process another 5 of the items in the queue.
Not much good for a request that takes 10s though I guess - don't want something like that to delay a response, and my hunch is that with serverless platforms such as Lambda and Cloud Run delivering a request and then trying to keep on processing after the request had been returned to the user would get very tightly CPU throttled, making it not worthwhile.
Yeah, the wp-cron style approach could definitely be done - AP is very permissive and the only reason I need it running constantly is because I use the same queue system to fan-out posts to internal timelines as well as external - but the target list of people who would find that convenient in today's modern deployment world is really small compared to the effort it takes to engineer!
Would be an option to use s6¹ to make it start and monitor both gunicorn and stator? Never tried it but I have seen it used in "fat containers".
For the option of having a single docker container which runs both the web server and the stator worker processes in the same container I think an easy option would be to leverage the existing Procfile with something like Honcho.
For example could create a docker file, pip install honcho, then use honcho start
as the command, which will then start both a web and worker (and optionally change the number of each with hocho start -c web=1,worker=2
for example.
I mean there's many other options but leveraging the existing Procfile seems like the easiest path, and the commands should stay in sync if they change in future releases.
The Procfile
works great for me in Piku, and Honcho is pretty reliable, so I'd vote for that.
Have you considered building a serverless queue option on something like SQS and Lambda? (And similar alternatives outside AWS) SQS doesn't cost anything if the queue is empty, so you'd really only pay for lambda run time while delivering statuses.
No, because Stator is not a queue, it's a reconciliation loop - it runs based on database status and knows how long it's been since things were last attempted. It won't fit into a standard queue datastore.
No, because Stator is not a queue, it's a reconciliation loop - it runs based on database status and knows how long it's been since things were last attempted. It won't fit into a standard queue datastore.
Gotcha. I was thinking for a very low traffic instance you could just trigger it whenever a new message is posted, but you wouldn't actually need the queue for that. It could just trigger the background process/lambda directly
How married is Takahē to gunicorn? Could we leverage Gunicorn's worker handling to automatically (re)start one or more stator workers?
Potentially - I don't think it's married particularly hard, the contract is just about the docker container more than what's specifically in it.
I'm very interested in the Django-based Fediverse server, but I think there is a serious problem if two copies of it need to be running with the same configuration for the application to work properly. It might be worth thinking about how to run a background process independent of any third party solutions like supervisor, container engine or whatever.
Perhaps it would be even better if the web application and the Fediverse server were two different applications with different code bases and their own settings for each.
I wish the project further development and thank you for your efforts to improve it. This is a really cool thing.
I love how this is designed to work well on serverless hosting. I have a slightly different way that I'd like to host this though.
I run a lot of apps as containers on https://fly.io/ these days. Their smallest instance - 256MB of RAM - costs $1.94/month for an always-running container.
My hunch is that this instance would be powerful enough to run Takahē. I plan to try that out!
(Naturally I'm also interested in seeing if Takahē could run on an instance like that using SQLite instead of PostgreSQL, will report back should I get that working.)
But assuming I'm using PostgreSQL, I'm still interested in how hard it would be for Stator to constantly run within that same single Django process in the container - provided the container is expected to stay alive and not get scaled to zero.
Running in the same Python process seems to me like it would be operationally simpler, just because every time I need to run multiple processes in the same Docker container I end up reading about dozens of competing strategies, none of which feel very natural.
People with the luxury of more than a single virtual CPU would obviously want to run a separate process, and more power to them!
So: would it be feasible / a big lift for Stator to have an option to run continually within the default Django process?