Docker image variant that runs both Gunicorn and Stator

simonw commented 1 year ago

I love how this is designed to work well on serverless hosting. I have a slightly different way that I'd like to host this though.

I run a lot of apps as containers on https://fly.io/ these days. Their smallest instance - 256MB of RAM - costs $1.94/month for an always-running container.

My hunch is that this instance would be powerful enough to run Takahē. I plan to try that out!

(Naturally I'm also interested in seeing if Takahē could run on an instance like that using SQLite instead of PostgreSQL, will report back should I get that working.)

But assuming I'm using PostgreSQL, I'm still interested in how hard it would be for Stator to constantly run within that same single Django process in the container - provided the container is expected to stay alive and not get scaled to zero.

Running in the same Python process seems to me like it would be operationally simpler, just because every time I need to run multiple processes in the same Docker container I end up reading about dozens of competing strategies, none of which feel very natural.

People with the luxury of more than a single virtual CPU would obviously want to run a separate process, and more power to them!

So: would it be feasible / a big lift for Stator to have an option to run continually within the default Django process?

simonw commented 1 year ago

I realize now that I'm not entirely sure what "run stator in-process" even means for an app that's running using 8 gunicorn workers:

https://github.com/andrewgodwin/takahe/blob/80193114909a3f6ca1eda9a47b6330ef249a8ee5/docker/start.sh#L5

Would there need to be a mechanism to have it run in just one of those workers?

My lack of depth of understanding as to how Django actually works when run with gunicorn -w 8 is betraying me here.

andrewgodwin commented 1 year ago

So, funny thing, I tried the whole "running it on request-based serverless" thing and it just doesn't work. You have to keep the process alive with a cron job by calling a URL every 60 seconds that lasts for 60 seconds, and I have found literally no platform where that's both reliable and cheaper than even a moderately-expensive server.

Thus, the whole "call it every X seconds" thing is now trashed, and runstator is now a long-running process, that I suspect could totally be squished next to gunicorn. Not going to do that yet - I'm just running two processes for now, and I'd hope fly can do the same.

simonw commented 1 year ago

Fly will happily run a Dockerfile which starts multiple processes - looking forward to reading your Dockerfile to see which pattern you like for doing that!

simonw commented 1 year ago

So, funny thing, I tried the whole "running it on request-based serverless" thing and it just doesn't work. You have to keep the process alive with a cron job by calling a URL every 60 seconds that lasts for 60 seconds, and I have found literally no platform where that's both reliable and cheaper than even a moderately-expensive server.

Something I don't fully understand is why you need to run every 60s or so.

My incomplete mental model of how ActivityPub works is this:

Other servers could send stuff to your inbox at any time. You need to store that stuff when this happens.
This is also a thing that happens when someone starts following you - effectively an incoming webhook when that happens
Any time you send a message your server has to do a bunch of work - it needs to deliver that to the inboxes on the set of servers that contain at least one of your followers

Based on this, my naive thought was that the only time you need to really do hard work is when you are sending a status. Everything else should hopefully fit into a regular request/response cycle (maybe taking a few seconds longer, but that works fine on stuff like Cloud Run - I've built systems on Cloud Run where an incoming POST request triggered an endpoint that took 30s to finish, and it worked OK).

Other than message delivery, what are the tasks that need to be triggered by workers that run at least once every 60s?

How about if everything happened in regular requests without any cron-style activity at all... and any time a user published a status they got a progress bar showing how many servers it had been delivered to, powered by HTTP polling of their browser to an endpoint?

And if they shut their browser tab too early some more messages would be sent next time a request came in, kind of like how WP-cron works: https://developer.wordpress.org/plugins/cron/

WP-Cron works by checking, on every page load, a list of scheduled tasks to see what needs to be run. Any tasks due to run will be called during that page load. [...] Scheduling errors could occur if you schedule a task for 2:00PM and no page loads occur until 5:00PM.

manfre commented 1 year ago

Another option is to provide (documentation for) an all-in-one gunicorn.conf.py using the when_ready() hook to spawn a stator thread. https://docs.gunicorn.org/en/stable/settings.html#when-ready

simonw commented 1 year ago

Another option is to provide (documentation for) an all-in-one gunicorn.conf.py using the when_ready() hook to spawn a stator thread. https://docs.gunicorn.org/en/stable/settings.html#when-ready

Oh that's really interesting, I hadn't seen those options before.

andrewgodwin commented 1 year ago

Based on this, my naive thought was that the only time you need to really do hard work is when you are sending a status. Everything else should hopefully fit into a regular request/response cycle (maybe taking a few seconds longer, but that works fine on stuff like Cloud Run - I've built systems on Cloud Run where an incoming POST request triggered an endpoint that took 30s to finish, and it worked OK).

Other than message delivery, what are the tasks that need to be triggered by workers that run at least once every 60s?

Message delivery is a big part - you have to deliver thousands of items, with retries, to servers that may take tens of seconds to respond. Even async makes that hard to do in 30 seconds!

Some other things:

Identity fetching and refreshing, as all you get in the other messages are Actor URIs
Retrying Create/Accept/Undo messages for follows, likes, boosts
Fetching images/media from new posts to cache them locally for the user
Sending emails (I mean, gotta do it somewhere)

In theory you could try and cram all this into a request-oriented model, but it's going to fall over quite quickly and feel like posts are very slow to appear for everyone. After looking at the options here, I think it's best to go with our tried and trusted friend the worker process if you want reliable, easy-to-maintain software (after all, making it all async is already enough of a stretch given the libraries involved)

simonw commented 1 year ago

you have to deliver thousands of items, with retries, to servers that may take tens of seconds to respond

I forgot about retries! Yeah that totally makes sense that those would need longer running processes - especially since presumably they're really important, since if you don't keep retrying then the people on that server will never see your latest posts (unlike if it was a pull mechanism similar to RSS where it's up to them to backfill stuff they missed later on).

I still think something like the WP-cron model could be interesting, just because ActivityPub seems to be quite forgiving for things that are delayed by a chunk of time. Each time anything hits a server could be an opportunity to process another 5 of the items in the queue.

Not much good for a request that takes 10s though I guess - don't want something like that to delay a response, and my hunch is that with serverless platforms such as Lambda and Cloud Run delivering a request and then trying to keep on processing after the request had been returned to the user would get very tightly CPU throttled, making it not worthwhile.

andrewgodwin commented 1 year ago

Yeah, the wp-cron style approach could definitely be done - AP is very permissive and the only reason I need it running constantly is because I use the same queue system to fan-out posts to internal timelines as well as external - but the target list of people who would find that convenient in today's modern deployment world is really small compared to the effort it takes to engineer!

eriol commented 1 year ago

Would be an option to use s6¹ to make it start and monitor both gunicorn and stator? Never tried it but I have seen it used in "fat containers".

¹ https://www.skarnet.org/software/s6-linux-init/

cosmin commented 1 year ago

For the option of having a single docker container which runs both the web server and the stator worker processes in the same container I think an easy option would be to leverage the existing Procfile with something like Honcho.

For example could create a docker file, pip install honcho, then use honcho start as the command, which will then start both a web and worker (and optionally change the number of each with hocho start -c web=1,worker=2 for example.

I mean there's many other options but leveraging the existing Procfile seems like the easiest path, and the commands should stay in sync if they change in future releases.

rcarmo commented 1 year ago

The Procfile works great for me in Piku, and Honcho is pretty reliable, so I'd vote for that.

rjmackay commented 1 year ago

Have you considered building a serverless queue option on something like SQS and Lambda? (And similar alternatives outside AWS) SQS doesn't cost anything if the queue is empty, so you'd really only pay for lambda run time while delivering statuses.

andrewgodwin commented 1 year ago

No, because Stator is not a queue, it's a reconciliation loop - it runs based on database status and knows how long it's been since things were last attempted. It won't fit into a standard queue datastore.

rjmackay commented 1 year ago

No, because Stator is not a queue, it's a reconciliation loop - it runs based on database status and knows how long it's been since things were last attempted. It won't fit into a standard queue datastore.

Gotcha. I was thinking for a very low traffic instance you could just trigger it whenever a new message is posted, but you wouldn't actually need the queue for that. It could just trigger the background process/lambda directly

AstraLuma commented 10 months ago

How married is Takahē to gunicorn? Could we leverage Gunicorn's worker handling to automatically (re)start one or more stator workers?

https://docs.gunicorn.org/en/stable/design.html

andrewgodwin commented 10 months ago

Potentially - I don't think it's married particularly hard, the contract is just about the docker container more than what's specifically in it.

ivanov17 commented 7 months ago

I'm very interested in the Django-based Fediverse server, but I think there is a serious problem if two copies of it need to be running with the same configuration for the application to work properly. It might be worth thinking about how to run a background process independent of any third party solutions like supervisor, container engine or whatever.

Perhaps it would be even better if the web application and the Fediverse server were two different applications with different code bases and their own settings for each.

I wish the project further development and thank you for your efforts to improve it. This is a really cool thing.

jointakahe / takahe

Docker image variant that runs both Gunicorn and Stator #5