Open kwakwaversal opened 4 years ago
Hey @kwakwaversal thanks for the amazing write up.
the checkboxes in the README for this repo suggests you're already in Beta
Oops. Thanks for the heads up! We are standardising release statuses across the whole org. I shifted it back to alpha - I think that it's a fairer representation of the state of the server, but also Supabase as a whole. We will move into Beta once we have some benchmarks and, more importantly, we will hire someone who can support this repo full time (we're a very small team right now)
update or insert records to ES/Solr. Guaranteed delivery is obviously preferred for this.
Very cool. This one will be doable with the webhooks (https://github.com/supabase/realtime/issues/33). Once I iron out the issue with reusing replication slots (https://github.com/supabase/realtime/issues/22) then it will have guaranteed read from Postgres and send. I'm reluctant to call it guaranteed delivery, because once it sends it to ES it won't wait for an acknowledgement back. If the payload fails to reach ES (for whatever reason), the change event is lost too. This might be something we work on in the future, but it will be a major development (since it requires persistence).
the payload size including the types object, while useful, really adds up over thousands of records (even more if you're receiving the OLD record)
Interesting - also one that I hadn't thought of. I'll dig into the idea of a JSON transformer. The payload values are all strings so it might be more universally useful to add a boolean flag like PARSE_PAYLOAD
, and if true, this server could parse the payload into the correct types, dropping the columns
key altogether. This will have a an impact on performance, so it's one that we will have to weigh up. You would also lose a lot of detail (is it a smallint
or bigint
?), but in many cases that won't matter.
My initial thoughts on this:
Therefore, it may not be the best idea if it's only to save space on a database. Any reason you couldn't cleanse the column
from the payload after it has landed in ES?
I'm assuming the preferred approach to reduce the load on the server and the streaming output from PostrgreSQL is to only create a publication for a subset of tables
Yes, that's correct. If there are tables you don't care about streaming, just don't enable the replication. Let me know if that's a problem for your use-case!
Subscribing to tables instead of creating lots of NOTIFY triggers (as we currently do in production).
In case you didn't see this, there is an 8000 byte limit for NOTIFY payloads, so be careful that your system isn't silently dropping the events. Postgres actually raises an error but it's hard to catch. This is the exact reason why we created Realtime last year - I was using trigger/NOTIFY and discovered this the hard way ..
I will explore a JSON transformer with one of our team and let you know. It could take some time but I think it's worth considering as part of our work on https://github.com/supabase/realtime/issues/47
I'm reluctant to call it guaranteed delivery, because once it sends it to ES it won't wait for an acknowledgement back.
I think it would make sense for me to handle to the writes to Solr rather use create a webhook in the server. I would need to update Solr FIFO but I read in https://github.com/supabase/realtime/issues/33#issuecomment-635010703 that you suggested webhooks that don't return 2XX
that a log is written somewhere. That would break the order of the records that update the relevant documents in Solr which isn't ideal.
As a side note, if records that are not successfully forwarded to a webhook aren't easily queried (and optionally extracted) I think this might be an issue. A log is great obviously, but I would like to see some stats if some webhooks fail for whatever reason. Programmatically polling the server for failed webhooks would be helpful, but then you're changing the simplicity of the server. It does sound more and more like you're going to end up having to write some job queue for the webhooks which is non-trivial. Or at the very least guarantee writes to somewhere so that events are not lost.
The payload values are all strings so it might be more universally useful to add a boolean flag like PARSE_PAYLOAD, and if true, this server could parse the payload into the correct types, dropping the columns key altogether.
Parsing the payload into the correct types might be useful, but I wouldn't say it's necessary. A JSON transformer would allow someone to cherry pick properties so that in a table of 30 columns, it would just return 3 and also be able to rename the columns in the process.
storage is cheaper than compute
Therefore, it may not be the best idea if it's only to save space on a database. Any reason you couldn't cleanse the column from the payload after it has landed in ES?
When I mentioned the payload size, I wasn't talking about storing it, I was literally talking about the payload size that's being sent across interfaces. The network traffic between hosts.
If I have a DB server, and put supabase/realtime
on there, then having a consuming websocket server on a different host I'm going to get a lot of data being sent between the hosts. It's quite possible that I am throwing away a lot of that data so it would be more performant for me to revert back to sending NOTIFY
from triggers. But then I lose some of the flexibility that supabase/realtime
was giving me originally by not having to keep changing me database migrations to add/remove columns from the NOTIFY
payload. As discussed, the JSON transformers would really shine here.
In case you didn't see this, there is an 8000 byte limit for NOTIFY payloads
Hah, yeah. I ran into this when trying to send emails stored in the database over the notifies. I mean, it makes sense to maintain performance but at the time it was a head scratcher.
I will explore a JSON transformer with one of our team and let you know. It could take some time but I think it's worth considering as part of our work on #47
Great stuff.
For anyone else that might stumble across this issue looking for a way of syncing PostgreSQL to ES/Solr, https://debezium.io/ looks like a good alternative. I was hoping not to add too much extra to the stack because this becomes: ES/Solr, Debezium, Zookeeper, Kafka but I guess that's the price you have to pay for guaranteed delivery in this instance.
I was literally talking about the payload size that's being sent across interfaces. The network traffic between hosts.
Got it, that makes sense.
Debezium
Yes, great system! Definitely heavy if you want full functionality I think it is the only thing on the market right now.
Thanks again for the context. I will need to chat to Francesco about this use-case. He's not active right now, so unfortunately it could take a while to see progress on this one. Nonetheless, here are the actions I have:
@kwakwaversal
Addendum
For anyone else that might stumble across this issue looking for a way of syncing PostgreSQL to ES/Solr, https://debezium.io/ looks like a good alternative. I was hoping not to add too much extra to the stack because this becomes: ES/Solr, Debezium, Zookeeper, Kafka but I guess that's the price you have to pay for guaranteed delivery in this instance.
I was looking for a debezium alternative (to avoid JVM stack) and stumbled upon this issue. Are you able to succeed with the attempt to use supabase for Elastic search sync? In my case, I have found Elastic search alternative MeiliSearch (crosslinking https://github.com/meilisearch/integration-guides/issues/20)
On other note, for JSON transforming, I think these libraries are better and popular: https://github.com/jmespath/jmespath.js, https://github.com/jsonata-js/jsonata or https://github.com/wankdanker/node-object-mapper (FWIW, https://www.npmtrends.com/json-query-vs-jsonata-vs-JSONPath-vs-jsonpath-vs-jsonpath-plus-vs-jmespath-vs-object-mapper )
Thanks for the parsers @rrjanbiah - we might need to find something elixir-native, but these are good prior art to work from.
Our current discussion is to update the event parser to be more inline with AWS's Simple Work Flows. eg: https://states-language.net/#choice-state. This is a full state machine, which should make it this a very versatile server. It's a large task though, so we'll have to spend time figuring out if it's possible and if we can ship it in parts
@kiwicopple
Thanks for the parsers @rrjanbiah - we might need to find something elixir-native, but these are good prior art to work from.
Looks not maintained, but I found this one https://github.com/stephan83/ex-jmes
Our current discussion is to update the event parser to be more inline with AWS's Simple Work Flows. eg: https://states-language.net/#choice-state. This is a full state machine, which should make it this a very versatile server. It's a large task though, so we'll have to spend time figuring out if it's possible and if we can ship it in parts
Not quite understand here... but since, you're referring to the state machines, you may want to check https://github.com/davidkpiano/xstate as that's quite popular
This is a great idea and with Realtime v2 we have an extensions concept where we could additionally insert your realtime feed into another database.
https://github.com/supabase/realtime/issues/40#issuecomment-680699263
Hi @kiwicopple. I'm currently in an exploratory stage following a requirement to sync data from PostgreSQL to Solr or Elasticsearch and seeing what's out there. I knew logical replication was the only way to go but needed to explore existing solutions. I heard about supabase realtime a year ago so been kicking the tyres to see what it could do.
My thinking was to subscribe to tables of interest and update or insert records to ES/Solr. Guaranteed delivery is obviously preferred for this. I don't like the idea of having to occasionally sync hundreds of thousands of emails that might be out of sync.
While exploring
supabase/realtime
however, I see how it might be beneficial for other aspects of a product I'm working on.• Subscribing to tables instead of creating lots of
NOTIFY
triggers (as we currently do in production).One thing that jumped out about this is that the payload size including the
types
object, while useful, really adds up over thousands of records (even more if you're receiving the OLD record). I think I'd prefer it if when subscribing I could request for specific data.This seems to be partly discussed here: https://github.com/supabase/realtime/issues/47 but that seems to just be conditionally filtering data (which is good). What I would find useful would be requesting a subset of the data so optionally only the bare minimum needs to be transmitted between the realtime server and client. A JSON query and transformation language like https://github.com/schibsted/jslt or https://github.com/jsonata-js/jsonata would be ideal as this could be passed as an argument on subscription and transform the data on the server side. I don't know equivalent libraries are available in Elixir.
Question: I'm assuming the preferred approach to reduce the load on the server and the streaming output from PostrgreSQL is to only create a publication for a subset of tables rather than
CREATE PUBLICATION supabase_realtime FOR ALL TABLES;
?BTW the checkboxes in the README for this repo suggests you're already in Beta (https://github.com/supabase/realtime#status) but the paragraph below suggests otherwise. I didn't read the paragraph below originally and just the checkboxes so thought you were in Beta. :)