Closed lazyguru closed 10 months ago
While my least preferred option is 5, the list is not ordered by preference (it just so happened that I thought of 5 last)
I would prefer number 1 as it would be the most reliable and stable option for me, the installation we can just make like a installation script or simply a docker compose. Number 3 would be loved by small instances, as you just pop it in there and boom it works, and because its small it doesnt have really so much performance issues. Number 2 could be managed easily with code as infrastructure script but would bump costs up drastically for small instances for "bigger" like lemmy.world it could help with scaling federation but like said it would cost a ton more.
I think option 1 makes most sense. For smaller instances, you can use same db server, so there is no additional service needed, and for bigger instances, you have full power of RDBMS with you.
Option 5 is how I pictured it working initially. I like 5 or 1.
Would in Option 1 be possible to cluster the application itself? So for many federated instances it can keep up with the federation?
I would say option 2 seems the best, let queuing software do what it does best. You can run RabbitMQ in Docker and the supporting of doing SQS seems really forward thinking if the app gets LARGE (ex 2-3 million users per instance). Should try and be horizontally scalable easily.
Option 4, Redis, seems to fit more for a caching layer, I'd skip using it for queuing.
Options 3, SQLLite, I'd skip as well, when you can run MySQL or PostGRE even on a RasperyPi. Option 1, is the same in my mind, let queues be queues and DB's be for persistent user data. I guess I'd skip using a DB entirely for federation traffic.
Option 5, I don't think will scale well when things get REALLY busy. If we keep federation / inter node communication separate, we can have a easier time with the API back end handling JUST UI / mobile app traffic.
TLDR Supporting RabbitMQ AND SQS would be super neato and let this thing S-C-A-L-E <3
Yeah after investigating RabbitMQ looks really good and scalable. I would change my preference to option 2 then.
Would in Option 1 be possible to cluster the application itself? So for many federated instances it can keep up with the federation?
@Pdzly Yes. The only option presented here that would have a problem with horizontal scaling (eg cluster) is option 3 (however, if you use a 3rd party host like Turso, you could cluster then)
Option 4, Redis, seems to fit more for a caching layer, I'd skip using it for queuing.
@Jelloeater redis is not just a cache. I know a lot of people only think of it that way, however it can be used as a message queue as well as a database (DMS, not RDMS as it doesn't handle the "R" eg relationships). At my day job we use redis as a queue/job-scheduler and it works quite well.
I would say option 2 seems the best, let queuing software do what it does best.
I listed option 2 as I knew it would be suggested. However, it doesn't solve the problem this RFC attempts to solve. Namely, how do you support the need for immutable data being returned to the caller of say /activity/create/<some-uuid>
. In federation, all messages are signed by the creator. A receiver of a federated message will call the origin server to retrieve the activity and verify the signature. If a user edits their post, the /activity/create/<some-uuid>
link should still return the original post message. This is why option 5 exists. It is essentially option 2 but actually solving for the persistence issue. However, option 5 is not necessarily the only way to use option 2 while solving for immutability.
TLDR Supporting RabbitMQ AND SQS would be super neato and let this thing S-C-A-L-E <3
The only option presented that has issues with scaling is option 3. Any of the others support scaling. There are downsides to using RabbitMQ or SQS. The FIFO/LIFO problem for one. Let's say we have an instance that is down. The messages that should be retried later will go back in the queue. Since the queue does not support jumping out of order, it will mean that as you get more and more instances that are down the queue will get longer and longer slowing down the ability to send messages to instances that are reachable/up. Even if you say "ok, run 2 queues. 1 for first time sent messages and 1 for retries" you still have the same problem. What if an instance is down for maintenance? Now some messages for that instance will get further delayed when it is back up because they will be at the back of the line behind all of the messages queued for instances that are down/unreachable. If you further say, "ok, let's have multiple queues to handle something like retry 1-N, N-Y, etc...", then you are creating a nightmare infrastructure for non/low-technical instance owners.
At the risk of derailing this RFC: I'm very interested in trying to use RabbitMQ as well. However, I don't want to base the use of a technology solely on my desire to use it. I want to ensure it also fits well (I've gone back and forth on using Golang for the federation service vs using Java. Honestly, I still stop once a week and consider if it shouldn't just be done in Java. I can't even guarantee that I won't ultimately decide to switch to Java. I am comfortable in both languages, but lean more towards Go as a personal preference (I know, this is counter to the first sentence in this paragraph. Which is why I still stop to reconsider fairly regularly)
Honestly, I still stop once a week and consider if it shouldn't just be done in Java. I
I'm super rusty at Java, but I've worked with GoLang a fair bit. I definitely like the way it multi threads, super easy compared to most of the other languages I've worked with. I think for anything that needs to run large amounts of data in parallel, GoLang would be more then fast enough, and easy enough to work with for most folks.
But yeah, I get it, trying to find the right tool for the right job is hard! I do like the idea and flexibility of doing this w/ Rabbit.
Since the queue does not support jumping out of order, it will mean that as you get more and more instances that are down the queue will get longer and longer slowing down the ability to send messages to instances that are reachable/up
We might be able to use https://blog.rabbitmq.com/posts/2021/07/rabbitmq-streams-overview/ as well. Some nice libraries for BOTH Java AND Go. I think if we leveraged streams, the message backlog issue would be LESS of a issue due to being able to catch up easier then a normal queue.
I know a lot of people only think of it that way, however it can be used as a message queue as well as a database (DMS, not RDMS as it doesn't handle the "R" eg relationships).
No hate on Redis, it's popular for a reason π
Namely, how do you support the need for immutable data being returned to the caller of say /activity/create/
. I
GP on the immutability, hmmm... Yeah, I guess that's an issue with folks editing posts. I guess option 5 would be a fair compromise on queuing vs immutability.
If we needed to scale, we could run multiple Worker nodes for federation and multiple API for user requests and keep the back end DB as a source of truth. Make sense to me and I think it would be easier to implement as a micro-service. GP on letting folks opt out of federation, if they just wanna run a private forum.
@lazyguru if you have any better idea, LMK, I'm all ears π I can try and sketch out a quick mermaid diagram after work π
About rabbitmq:
We might be able to use https://blog.rabbitmq.com/posts/2021/07/rabbitmq-streams-overview/ as well. Some nice libraries for BOTH Java AND Go. I think if we leveraged streams, the message backlog issue would be LESS of a issue due to being able to catch up easier then a normal queue.
It may solve that issue, but the issue to get a specific message still remains. Unless we get that from Core on request, but then that kind of beats the purpose of federation service having its own persistence layer.
Rabbitmq is the least complex. If they do that on the full RDMS its the same or even more for that database.
Rabbitmq can just drop the earlier message to reduce double processing.
I feel RabbitMQ would be the easiest to scale as it has literally a simple docker-compose to scale x times a cluster. You can have multiple queues / streams for example 1 for Comments 1 for Posts.
And with that the service itself is rather the bottleneck for the federation for big instances.
All of those features increase the complexity, some of those are easier scalable or creatable.
And it has even a Go AND java library so it has even the best integration in both possible languages.
I'd also like there to be consideration for small instances and large instances. Perhaps the first iteration is for smaller instances, and we can build a second solution for large instances later.
Everything except 3 would be already good for bigger instances.
As someone who have hosted some single-user fediverse services, too many different dependencies makes it hard to manage these. But, if that's not the concern or relevant to the discussion here, then ignore this point.
That is definitely a concern here.
Rabbitmq can just drop the earlier message to reduce double processing.
@Pdzly I am not sure what you mean here? The only time we would drop a message is if we decide an instance is never coming back up. If we were to set a TTL on a message, it would almost certainly be a fairly large amount of time (eg weeks, not days or hours). So there is no dropping of a message to get to others. It would mean something like:
Pass 1 through queue: id-123 -> unreacable id-456 -> sent id-789 -> unreachable id-012 -> unreachable id-345 -> unreachable id-678 -> unreachable id-901 -> unreachable id-234 -> sent id-567 -> unreachable
Pass 2 id-123 -> not-time-to-resend-yet id-789 -> not-time-to-resend-yet id-012 -> not-time-to-resend-yet id-345 -> not-time-to-resend-yet id-678 -> not-time-to-resend-yet id-901 -> not-time-to-resend-yet id-567 -> not-time-to-resend-yet id-abc -> sent
As you can see, the new message "id-abc" has to wait for the worker to process all of the other messages before it can be sent (even though all of those messages are not ready to send because it hasn't been long enough to hit the retry point). This is only a small sample of what it would look like. Over time the list will grow a lot.
Unless we get that from Core on request, but then that kind of beats the purpose of federation service having its own persistence layer.
However, that's a part of the question here. Do we decide to not have a persistence layer and require the main app to maintain a table with the immutable data and respond to API calls? If so, then the federation service becomes a postal worker and just picks up an envelop, reads the "to" address, and delivers it. It also accepts "mail" from other instances and pushes it into a return queue for the main app to open, read, and process.
The thing we do at work is just to put it back into the queue and track how many times it was already tried.
Or just put it in a seperate queue for the "retries" .
After careful consideration, I've made the decision to archive this repo and move all federation work into the core of Sublinks. As much as I want to build it in Go, I just can't justify the extra complexity it would add at this stage. Moving forward in the core application will allow for faster development since things that would be needed from the core by the federation stuff can be handled in the same PR. The goal will be to try to keep the federation code as decoupled as possible so that it could potentially be migrated out at a later time (though we may decide to keep it in the core app indefinitely)
Let's use this as an RFC (Request For Comments).
RFC: Persistence Layer
Topic: Should the Federation service have a persistence layer and if so, which one?
Preface
The list under "Proposed Solutions" is not intended to be taken as the only solutions to be considered. It is absolutely acceptable to suggest other options while this RFC is open. Additionally, please suggest other pros or cons for each solution so we can take a fully informed decision.
Problem
The federation service must keep track of which remote federated instances it has sent messages to as well as which it still needs to send messages to. In addition to this, it needs to store an immutable copy of the message that can be retrieved by any remote federated instance at any time (now or some unknown time in the future). Lemmy handles this using a number of tables in the database:
Proposed Solutions
1. Full RDMS database
Create a separate database to live on its own and not share the same database as the core (along with separate user + permissions). Having a separate database is necessary to a situation where a migration in core inadvertently breaks federation (or vice versa).
Pros
Cons
2. Queue service (Rabbit MQ, SQS, etc..)
It is already an existing idea to use a queue service for communicating between the core and the federation services. A separate queue could be setup to handle for retries when outbound sending fails to communicate with a host.
Pros
Cons
3. Simple DB (eg SQLite)
Instead of running a full-blown RDMS, the federation service could make use of a SQLite DB.
Pros
Cons
4. Redis
Redis is a step between a Queue solution and a full RDMS database. It is a key/value store that has high performance. It can be run as a single instance, or can be run in a cluster mode to support fault tolerance and HA (high availability).
Pros
Cons
5. Queue service (Rabbit MQ, SQS, etc..)
No, this is not a repeat. The idea on this one is to reduce the scope of the federation service significantly. Instead of having the federation service responsible for both the transmission (sending/receiving) of messages as well as handling the endpoint requests using
Accept: application/ld-json
, we would move that logic into the core and have the federation service only manage the sending and receiving of messages. This is sort of similar to Lemmy's concepts of "worker" and "scheduler". The federation service would run in the background and simply process outbound requests using a queue and then pass inbound messages to the core via a queue as well. The only endpoints the federation service would implement would be the inbox and outbox endpoints for site, communities, & users. Core would be responsible for implementing endpoints for all activities (follow, undo, accept, comment, post, community, user, site)Pros
Cons