Existing Duck db file as a data source

DingoEatingFuzz commented 1 year ago

Is your feature request related to a problem? Please describe.

I have existing duckdb files lying around (and also a program that takes a zip and emits a duck) that I would like to analyze with Rill.

Describe the solution you'd like

My ideal (albeit maybe naive) workflow would be:

Add data source
Local file
Select a duckdb file
Internally Rill Attaches the db
Each table is now a data source

Describe alternatives you've considered

As far as I can tell the only alternative is to use CSV/JSON/Parquet files, but this is not a realistic option. The major reason is that the value have going straight from zip to duck is codifying the data modeling of files in the zip--especially complex MAP(VARCHAR, STRUCT(...)) types for deeply-nested JSON.

Additional context

Rill looks amazing! Maybe I'm just holding it wrong with this feature request, but I really want the dashboarding features without necessarily using the data modeling features.

begelundmuller commented 1 year ago

Hey, thanks for the request. The use case definitely makes sense to me.

Firstly, here's a way you might be able to get this working today – use rill start --db my_duckdb.db to connect Rill to your DuckDB database file (instead of the default stage.db file). Your existing tables will not show up in Rill, but you can reference them in model or dashboard definitions (make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them).

Secondly, I'm thinking about how we can better support your use case. If we added support for the following, do you think that would address your use case?

Support an init.sql file where you can use ATTACH to attach existing databases
Show all pre-existing tables in the database in the sidebar

Third, do you have a hosting use case for your dashboards? If yes, can you share where you store/host the DuckDB .db files you want to connect to?

DingoEatingFuzz commented 1 year ago

Oooo, I'm going to try your suggesting of using --db today!

Support an init.sql file where you can use ATTACH to attach existing databases

I can see this being useful. It's still fuzzy in my mind how I'd like the overall workflow to go, but this meets the requirement of Person A writes init.sql and Person B clones a repo or whatever and launches Rill without needing to get into the weeds.

Show all pre-existing tables in the database in the sidebar

I would love this. Rill as a database client and also a dashboarding tool <3

Third, do you have a hosting use case for your dashboards? If yes, can you share where you store/host the DuckDB .db files you want to connect to?

Not yet! I'll probably try a few things while prototyping. Right now I think something low-tech would work best. Like a private S3 bucket. I think this would be a sweet spot between local analysis and full-bore data pipelines, and it would be hard for someone to screw up--worst case is they upload an invalid file; that still wouldn't corrupt existing data.

Thank you for your quick response!

DingoEatingFuzz commented 1 year ago

Coming back to say that the --db flag worked like a charm ✨

kmatt commented 1 year ago

@begelundmuller Would the --db flag allow use of a DuckDB file containing views that can be referenced by models?

This allows definition of data sources reading local data files, but with transforms in the database that Rill sees as tables.

(make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them)

I'd suggest a change where Rill throws an error rather than overwriting DuckDB objects for this approach.

begelundmuller commented 1 year ago

@begelundmuller Would the --db flag allow use of a DuckDB file containing views that can be referenced buy models?

This allows definition of data sources reading local data files, but with transforms in the database that Rill sees as tables.

Hey @kmatt, yes it would allow that, but if it's a view that queries files directly, you might run into performance issues since Rill issues a lot of queries and therefore benefits from data being ingested into DuckDB's native format (we use CREATE TABLE ... AS (SELECT ...) for sources instead of CREATE VIEW ...).

However, in the last release (0.32), we actually added support for free-form DuckDB SQL sources – I think they should address your use case. The syntax is:

# sources/my_source.yaml
type: duckdb
sql: "SELECT * FROM read_parquet('./data/my_data.parquet')" # can be any valid DuckDB SELECT statement

(make sure not to create a source or model that overlaps with any of the existing table names since they will then overwrite them)

I'd suggest a change where Rill throws an error rather than overwriting DuckDB objects for this approach.

Unfortunately we can't easily do this. Since code file definitions and the underlying data can change in-between Rill sessions, we don't always know if a table in DuckDB was created by a previous version of Rill or by the user. We keep some metadata about ingested data in an internal table (rill.catalog) that could help with that, but there are some error conditions where it will go out of sync (DuckDB transactions could help here, but we have found them to be too unstable, so we avoid them as much as possible for the time being).

What if we instead added support for creating tables in a custom database schema? For example, we could add a flag rill start --duckdb-schema custom, where Rill would create views and tables as custom.my_source instead of main.my_source (main is the default schema in DuckDB). Would that be useful?

k-anshul commented 11 months ago

Hey

We just merged the changes that allows DuckDB data file to be used as a source. This will be available in the next release version but can now be used with the nightly as well (rill upgrade --nightly can be used to install the latest nightly version). Here's a sample source YAML that uses a duckDB file as a data source.

type: duckdb
db: /path/to/src.db
sql: select * from table

rilldata / rill

Existing Duck db file as a data source #2910