jwills / target-duckdb

A Singer.io target for DuckDB
Other
17 stars 12 forks source link

Feature request: Add support for using Parquet as external storage #20

Open aaronsteers opened 1 year ago

aaronsteers commented 1 year ago

Updated issue description (2023-03-30):

There are some great use cases where we'd love to use target-duckdb as an interop layer to write Parquet files.

Today, users sometimes are creating data flows where they first use target-parquet and then transforming with dbt-duckdb, whereas a more streamlined approach would be to let target-duckdb and dbt-duckdb both operate on the same Parquet-based datastore.

From the comment thread below in https://github.com/jwills/target-duckdb/issues/20#issuecomment-1464283936:

...As we think towards where to invest future efforts, and where to direct community members who want to interop with Spark and/or data lakes, I think target-duckdb might ultimately be a better layer for "table-like" operations in data lake paradigms.

I'm not sure how the target-parquet would handle a merge upsert operation, for instance. Whereas DuckDB's support for SQL transformations could likely be a better interface for data lake management operations.

Original question

Details We have some users interested in storing data within Parquet. Can this target be used in combination with DuckDB's support for Parquet datasets?
jwills commented 1 year ago

Hey AJ-- I think @matsonj just used target-parquet for that in his MDS in a box project b/c of the (current) instability of the DuckDB file format, and then used the support for external sources in dbt-duckdb to do transformations on the resulting data.

I was just going to take a pass over this repo to do some updates for DuckDB 0.7.x-- is there some reason target-parquet wouldn't work for your user, or something that I could improve on it using DuckDB?

aaronsteers commented 1 year ago

Hi, @jwills. Re:

is there some reason target-parquet wouldn't work for your user, or something that I could improve on it using DuckDB?

No reason I know of. I'm totally happy to recommend that model - target-parquet, with dbt-duckdb then consuming from the landed parquet files.

As we think towards where to invest future efforts, and where to direct community members who want to interop with Spark and/or data lakes, I think target-duckdb might ultimately be a better layer for "table-like" operations in data lake paradigms.

I'm not sure how the target-parquet would handle a merge upsert operation, for instance. Whereas DuckDB's support for SQL transformations could likely be a better interface for data lake management operations.

There's no rush on this, by the way. I just wanted to start this thread to see if what I'm thinking of would make sense.

jwills commented 1 year ago

Yeah, your reasoning there re: upsert operations makes sense and is valid IMO. I'm going to turn my attention back to this project next week once I get some dbt-duckdb stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.

aaronsteers commented 1 year ago

Yeah, your reasoning there re: upsert operations makes sense and is valid IMO.

Thanks for this validation!

I'm going to turn my attention back to this project next week once I get some dbt-duckdb stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.

Sounds great. Again, no rush from our side. Nothing per se is broken as of now, and this is more of a long-term strategic investment, I think.

I'll close this issue since the question is answered. Thanks again, and let us know if we can help in any way.

aaronsteers commented 1 year ago

Reopening (with an updated title) because I've been hearing a lot of interest in this.

I've updated the description to be more direct in terms of what I think the next steps may be.

cc @kgpayne

jwills commented 1 year ago

okay, cool-- as you can tell I've done ~ nothing to move this forward; do you want to chat about it somewhere? Meltano Slack?

aaronsteers commented 1 year ago

@jwills - great idea! I created a new channel for this: #-duckdb-warehousing-dev

(Join link for anyone not already in our slack: https://meltano.com/slack)

aaronsteers commented 1 year ago

Looks like @kgpayne has an implementation POC using external storage here:

ReneTC commented 1 year ago

Any updates on this? I'm running into the versioning error, with target-duckdb being older than the dbt-duckdb version. Can the merged feature solve the issue? I'm not sure how to use it, I was looking for some documentation without luck.

jwills commented 1 year ago

@ReneTC I think the move is to use a virtualenv-type solution to align your duckdb, dbt-duckdb, and target-duckdb versions together; I'd recommend:

duckdb==0.8.1
dbt-duckdb==1.5.2
target-duckdb==0.6.0

...but I'm on vacation for a couple of weeks and haven't tried them in combination yet.

ReneTC commented 1 year ago

Thanks I'll test this tomorrow and report back. Edit// conforming a meltano install with whatever tap and

  loaders:
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb==0.6.0
   transformers:
  - name: dbt-duckdb
    variant: jwills
    pip_url: dbt-core~=1.5.0 dbt-duckdb==1.5.2

works. Thank you!

On Thu, 20 Jul 2023, 20.19 Josh Wills, @.***> wrote:

@ReneTC https://github.com/ReneTC I think the move is to use a virtualenv-type solution to align your duckdb, dbt-duckdb, and target-duckdb versions together; I'd recommend:

duckdb==0.8.1 dbt-duckdb==1.5.2 target-duckdb==0.6.0

...but I'm on vacation for a couple of weeks and haven't tried them in combination yet.

— Reply to this email directly, view it on GitHub https://github.com/jwills/target-duckdb/issues/20#issuecomment-1644384639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJSZI4NOBYPZPK2A5FYR3JDXRFZCNANCNFSM6AAAAAAVULRJ7A . You are receiving this because you were mentioned.Message ID: @.***>