Open aaronsteers opened 1 year ago
Hey AJ-- I think @matsonj just used target-parquet
for that in his MDS in a box project b/c of the (current) instability of the DuckDB file format, and then used the support for external sources in dbt-duckdb to do transformations on the resulting data.
I was just going to take a pass over this repo to do some updates for DuckDB 0.7.x-- is there some reason target-parquet
wouldn't work for your user, or something that I could improve on it using DuckDB?
Hi, @jwills. Re:
is there some reason target-parquet wouldn't work for your user, or something that I could improve on it using DuckDB?
No reason I know of. I'm totally happy to recommend that model - target-parquet
, with dbt-duckdb
then consuming from the landed parquet files.
As we think towards where to invest future efforts, and where to direct community members who want to interop with Spark and/or data lakes, I think target-duckdb might ultimately be a better layer for "table-like" operations in data lake paradigms.
I'm not sure how the target-parquet
would handle a merge upsert operation, for instance. Whereas DuckDB's support for SQL transformations could likely be a better interface for data lake management operations.
There's no rush on this, by the way. I just wanted to start this thread to see if what I'm thinking of would make sense.
Yeah, your reasoning there re: upsert operations makes sense and is valid IMO. I'm going to turn my attention back to this project next week once I get some dbt-duckdb
stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.
Yeah, your reasoning there re: upsert operations makes sense and is valid IMO.
Thanks for this validation!
I'm going to turn my attention back to this project next week once I get some
dbt-duckdb
stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.
Sounds great. Again, no rush from our side. Nothing per se is broken as of now, and this is more of a long-term strategic investment, I think.
I'll close this issue since the question is answered. Thanks again, and let us know if we can help in any way.
Reopening (with an updated title) because I've been hearing a lot of interest in this.
I've updated the description to be more direct in terms of what I think the next steps may be.
cc @kgpayne
okay, cool-- as you can tell I've done ~ nothing to move this forward; do you want to chat about it somewhere? Meltano Slack?
@jwills - great idea! I created a new channel for this: #-duckdb-warehousing-dev
(Join link for anyone not already in our slack: https://meltano.com/slack)
Looks like @kgpayne has an implementation POC using external storage here:
Any updates on this? I'm running into the versioning error, with target-duckdb being older than the dbt-duckdb version. Can the merged feature solve the issue? I'm not sure how to use it, I was looking for some documentation without luck.
@ReneTC I think the move is to use a virtualenv-type solution to align your duckdb, dbt-duckdb, and target-duckdb versions together; I'd recommend:
duckdb==0.8.1
dbt-duckdb==1.5.2
target-duckdb==0.6.0
...but I'm on vacation for a couple of weeks and haven't tried them in combination yet.
Thanks I'll test this tomorrow and report back. Edit// conforming a meltano install with whatever tap and
loaders:
- name: target-duckdb
variant: jwills
pip_url: target-duckdb==0.6.0
transformers:
- name: dbt-duckdb
variant: jwills
pip_url: dbt-core~=1.5.0 dbt-duckdb==1.5.2
works. Thank you!
On Thu, 20 Jul 2023, 20.19 Josh Wills, @.***> wrote:
@ReneTC https://github.com/ReneTC I think the move is to use a virtualenv-type solution to align your duckdb, dbt-duckdb, and target-duckdb versions together; I'd recommend:
duckdb==0.8.1 dbt-duckdb==1.5.2 target-duckdb==0.6.0
...but I'm on vacation for a couple of weeks and haven't tried them in combination yet.
— Reply to this email directly, view it on GitHub https://github.com/jwills/target-duckdb/issues/20#issuecomment-1644384639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJSZI4NOBYPZPK2A5FYR3JDXRFZCNANCNFSM6AAAAAAVULRJ7A . You are receiving this because you were mentioned.Message ID: @.***>
Updated issue description (2023-03-30):
There are some great use cases where we'd love to use
target-duckdb
as an interop layer to write Parquet files.Today, users sometimes are creating data flows where they first use
target-parquet
and then transforming withdbt-duckdb
, whereas a more streamlined approach would be to lettarget-duckdb
anddbt-duckdb
both operate on the same Parquet-based datastore.From the comment thread below in https://github.com/jwills/target-duckdb/issues/20#issuecomment-1464283936:
Original question
Details
We have some users interested in storing data within Parquet. Can this target be used in combination with DuckDB's support for Parquet datasets?