latitude-dev / latitude

Developer-first embedded analytics
https://latitude.so
GNU Lesser General Public License v3.0
626 stars 21 forks source link

Cloud Embedded Storage layer #369

Open andresgutgon opened 2 weeks ago

andresgutgon commented 2 weeks ago

What is this?

We want to offer users the possibility of replicating the data from users' databases (sources) into Parquet files saved to an S3 bucket. Then all subqueries are done with a DuckDB client.

The benefit of this is that users' DB's receive less load and parquet/duckdb are pretty fast-loading queries.

How it works.

Users define a query as a embedded layer by defining some parameters (TO BE DEFINED) in query's config. This configuration tells our embbeding layer

  1. That this query has to be stored in S3 as a Parquet file
  2. Every how many days/hours the data is fetched from original database

After this config is in place and query is stored in the Embedded storage layer users can reference this query as they would do if the query was done from their database. Under the hood Latitude app will go and fetch this info from S3

Considerations

Using the embedded layer has some considerations to be made.

a) All queries referencing this query has to use DuckDB SQL syntax not users's DB SQL syntax. b) We need to provide in the queries a metadata saying when was last time this query was updated with original source. c) How we do the check for periodic updates? I think we need some kind of cron job that pass and check what

bash workspaces -> apps -> queries-in-those-apps

need to be stored in Parquet and and need to be refreshed. I think this system has to be a piece a part from current latitude server that run the queries. If we want to go this path we need to access that queries<-from-app<--from-workspace somehow from that service. We should start storing the apps reference.

TODO

andresgutgon commented 1 week ago

For Parquet creation/reading this one looks the one that's best maintained https://github.com/LibertyDSNP/parquetjs

Comparison with others

andresgutgon commented 1 week ago

For DuckDB this one is the official Typescript wrapper around the nodejs client https://www.npmjs.com/package/duckdb-async

I see duckDB supports writing to parquet directly https://duckdb.org/docs/data/parquet/overview#writing-to-parquet-files Not sure if we can do this somehow and avoid using a parquetjs package

andresgutgon commented 1 week ago

I'm reading about READING DB (PostgreSQL/MySQL) and Adding directly to Parquet https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

-- connect to the Postgres instance with the given parameters in read-only mode
ATTACH 'dbname=postgres user=postgres host=127.0.0.1' AS db (TYPE POSTGRES, READ_ONLY);
COPY ⟨table_name⟩ TO 's3://bucket/file.parquet';

The problem with this approach is that DuckDB doesn't have all the connectors we support Example: https://www.reddit.com/r/DuckDB/comments/1bp1977/connect_duckdb_with_snowflake_via_adbc/

geclos commented 1 week ago

Questions:

andresgutgon commented 1 week ago

This project could be interesting for connecting to sources that support ODBC connections like Snowflake https://github.com/rupurt/odbc-scanner-duckdb-extension