splitgraph / seafowl

Analytical database for data-driven Web applications 🪶
https://seafowl.io
Apache License 2.0
439 stars 13 forks source link

Subscriptions #337

Open gedw99 opened 1 year ago

gedw99 commented 1 year ago

Is there a way to subscribe to a query and only get changed / new / deleted results ?

gruuya commented 1 year ago

Not at the moment or in the near-term future, but it could be something we'd want to add later on (sorry for a bit of a late response).

gedw99 commented 1 year ago

On thanks .. will close then

gedw99 commented 1 year ago

@gruuya

i Wonder if the new rust delta lane storage changes this ??

https://github.com/delta-io/delta-rs

gruuya commented 1 year ago

Not really; note that the delta-rs migration effectively only changed the storage layer behind DataFusion's TableProvider::scan/ExecutionPlan::execute, and doesn't offer some primitives/hooks for query changed / new / deleted results type of subscriptions.

If you have some particular scenario/problem in mind feel free to add it here, so that it can serve as a guideline if/when we come back to this issue (it may also be that we think of another solution).

gedw99 commented 1 year ago

Well I am thinking about how a users does a mutation into data fusion. One way is use CDC off the original system like postresql or google sheets.

So the mutations goes into that system.

Then we have the reactive data scenarios. If you have a CDC then you can see the changes of the master data source like postresql or google sheets, so then you can get the the middle and correlate what data changes against what your uses are currently looking out that they did a query for a minute again , and so send them a change stream.

This only works though if you can do that correlation.

do you get the concept ?

i already do cdc on all this stuff .

also you guys are not using Flight SQL yet ? Cause this helps with doing the correlation .

gruuya commented 1 year ago

so then you can get the the middle and correlate what data changes against what your uses are currently looking out that they did a query for a minute again , and so send them a change stream.

I'm probably missing something here, but if the underlying issue is simply about making sure the users see the latest data note that this happens automatically once a change has been made. This touches on

how a users does a mutation into data fusion

in Seafowl we support a couple of ways of changing data (all of them batch-oriented):

As soon as some data is written to a table, this results in a new table version. This annuls the cached responses, and all subsequent queries always target the latest version by default.

You can also target previous table versions using the time travel syntax (as well as look at e.g. the diff between successive table versions).

do you get the concept ?

More likely, maybe you're referring to enabling streaming replication from master data sources using CDC files in a particular format, as a new approach to changing/writing data in addition to the ones above?

also you guys are not using Flight SQL yet ?

No, not currently.

gedw99 commented 1 year ago

Thanks @gruuya for the in-depth links. It’s me that’s missing something :)

I did not realise these aspects were done by seafowl for mutations. I am still getting familiar with data fusion aspects.

about Streaming changes via CDC. I currently do this already in other projects using nats message broker. The use case is that when a user updates their google sheets ir DB, they can see it in data fusion system.

I really want to adopt DataFusion and so am brain storming it with you.

The more I think about it the more I see it as being non intrusive. The cdc system just needs a way to talk to data fusion system and transform the data in.

I can track what user is “ watching what query “ out of bounds , and so then just rerun the query on data fusion and send the deltas to the user. I wonder if there is a way to track the version of a query and the data as a hash ? So then because data fusion versions data I can iWork out the diff a d send that to the user. So then I have subscriptions out of bounds of the data fusion system.

If you want to have a chat / video about it I would really like to explore it ..

gedw99 commented 1 year ago

Am reopening . Hoping to discuss this ..

gedw99 commented 1 year ago

(as well as look at e.g. the diff between successive table versions)

Is this available at the API / http level ? Cause then I can do subscriptions and only send what changed. I do all this by wrapping seafowl api with NATS message broker and so know what users are doing what. It also gives my global load balancing and auth / authz.

gruuya commented 1 year ago

Aha no worries, I think we're (getting) on the same page.

I wonder if there is a way to track the version of a query and the data as a hash ?

You could look at the ETag of the response, as this is currently based on the which specific data files are used in a query. Note that this doesn't involve changes to the query plan, hence issue #57. (There's also the system.table_versions metadata table for browsing table versions if that's useful.)

If you want to have a chat / video about it I would really like to explore it ..

Oh sure, I'd love to see if we can converge here—how about discord for starters: https://discord.gg/eFEFRKm?

Is this available at the API / http level ?

Yup, it works as any other query (check out the prior-to-last section of this blog post for an example); we just have the awkward syntax since the AS OF clause isn't supported by sqlparser for now.

mrchypark commented 4 months ago

I'm also interested in this topic.

I'm grateful to be trying Seafowl for log storage. As new data is constantly coming in, the table version keeps increasing, which causes the cache to be invalidated continuously.

As a workaround strategy, I'm creating tables by date or using similar approaches.

However, it would be great if we could achieve the same effect within a single table.

I'm not quite sure how to do this yet!