andresgutgon commented 2 weeks ago

What is this?

We want to offer users the possibility of replicating the data from users' databases (sources) into Parquet files saved to an S3 bucket. Then all subqueries are done with a DuckDB client.

The benefit of this is that users' DB's receive less load and parquet/duckdb are pretty fast-loading queries.

How it works.

Users define a query as a embedded layer by defining some parameters (TO BE DEFINED) in query's config. This configuration tells our embbeding layer

That this query has to be stored in S3 as a Parquet file
Every how many days/hours the data is fetched from original database

After this config is in place and query is stored in the Embedded storage layer users can reference this query as they would do if the query was done from their database. Under the hood Latitude app will go and fetch this info from S3

Considerations

Using the embedded layer has some considerations to be made.

a) All queries referencing this query has to use DuckDB SQL syntax not users's DB SQL syntax. b) We need to provide in the queries a metadata saying when was last time this query was updated with original source. c) How we do the check for periodic updates? I think we need some kind of cron job that pass and check what

bash workspaces -> apps -> queries-in-those-apps

need to be stored in Parquet and and need to be refreshed. I think this system has to be a piece a part from current latitude server that run the queries. If we want to go this path we need to access that queries<-from-app<--from-workspace somehow from that service. We should start storing the apps reference.

TODO

[ ] Decide where lives the infra responsible of fetching the queries and storing it as Parquet files.
[ ] Probably we need a way of running cron jobs. Maybe this initiative can be done together with moving deploys to jobs
[ ] More things?...

andresgutgon commented 1 week ago

For Parquet creation/reading this one looks the one that's best maintained https://github.com/LibertyDSNP/parquetjs

Comparison with others

andresgutgon commented 1 week ago

For DuckDB this one is the official Typescript wrapper around the nodejs client https://www.npmjs.com/package/duckdb-async

I see duckDB supports writing to parquet directly https://duckdb.org/docs/data/parquet/overview#writing-to-parquet-files Not sure if we can do this somehow and avoid using a parquetjs package

andresgutgon commented 1 week ago

I'm reading about READING DB (PostgreSQL/MySQL) and Adding directly to Parquet https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

-- connect to the Postgres instance with the given parameters in read-only mode
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE POSTGRES, READ_ONLY);
COPY ⟨table_name⟩ TO 's3://bucket/file.parquet';

The problem with this approach is that DuckDB doesn't have all the connectors we support Example: https://www.reddit.com/r/DuckDB/comments/1bp1977/connect_duckdb_with_snowflake_via_adbc/

geclos commented 1 week ago

Questions:

Once we've synced to parquet files, how do we tell customers to target this parquet file from their queries?
How are we gonna show the metadata regarding syncs? Does this mean we are gonna have to do a UI for a product no one is using yet?
Is this gonna be available in the OSS product? If so, how? We need background jobs which means OSS Latitude becomes a multi-process application that would need probably a docker compose workflow in order to work out of the box in local
Could we maybe go for an all-frontend approach? where users copy their data into a local duckdb file through some function that we provide, and they can target their queries there

andresgutgon commented 1 week ago

This project could be interesting for connecting to sources that support ODBC connections like Snowflake https://github.com/rupurt/odbc-scanner-duckdb-extension

latitude-dev / latitude

Cloud Embedded Storage layer #369

What is this?

How it works.

Considerations

TODO