Open astrojuanlu opened 5 months ago
Ibis has to be central to this
To give a specific example of how this is posing a problem to users: https://linen-slack.kedro.org/t/16366189/tldr-is-there-a-suggested-pattern-for-converting-vanilla-par#23c36a9d-7bea-40f9-a21f-cc6def7e9ccf
User tries to convert a Parquet file to a Delta table with a Kedro pipeline, only to see that DatasetError: DeltaTableDataset is a read only dataset type
. Supposedly there's rationale for this in the original PR from 3 years ago https://github.com/kedro-org/kedro/pull/964 but (1) the conversation is extremely long, and I can't pinpoint the exact moment it was decided to remove _save()
functionality, and (2) this was never documented in any place, so users are left in the dark.
Going through the PR again, I found a comment that spells the problem in detail https://github.com/kedro-org/kedro/pull/964#issuecomment-947761059
Update, Upsert/Merge, Delete
These are not directly consistent with the Kedro Principles & DAG, as
- The filepath is intrinsic to the
DeltaTable
- The
update
,merge
anddelete
methods are methods on theDeltaTable
and are immediately materialised (on call or on subsequentexecute
call on a merge builder)- We still need to inform the Kedro pipeline and DAG that this node has succeeded in a meaningful way
This is the problem we're addressing.
It's unclear how to write Kedro datasets that are amenable to UPSERT (aka MERGE aka "INSERT or UPDATE") operations https://github.com/kedro-org/kedro/pull/964
Upsert is mostly supported by database backends. You could simulate it in data frames using concat with indices, e.g. in pandas or spark, but it's not very clean.
For database backends, it is on the radar for Ibis support.
Regarding ELT, Kedro will probably never be as convenient as the Singer ecosystem and derivatives could theoretically be - however, the practical application of Meltano and Airbyte leaves some gaps, and maybe Kedro could be a satisfying tool for some users. Regarding ETL, I think Kedro could be a perfect framework for this
Why can Kedro not be at least as good at ELT as it is at ETL? As long as you can interact with databases natively using SQL under the hood, I think it can be a great option for the people who are going to use Python anyway (or prefer to).
Also, I'm not recent enough on this perhaps, but has there been any momentum back towards people wanting to do ETL? If ELT is still where "modern data engineering" is at, doing ETL well isn't that exciting.
Finally, we should execute on messaging/value proposition updates based on the conclusions of our investigation, and probably generate appropriate educational material in the form of documentation, blog posts, and videos.
:100:
If Kedro is a tool that supports both data pipelines and ML pipelines, it makes sense that people are educated on how to write each, and don't use the same approach for the disparate problems.
Also, I'm not recent enough on this perhaps, but has there been any momentum back towards people wanting to do ETL? If ELT is still where "modern data engineering" is at, doing ETL well isn't that exciting.
We could discuss whether the Modern Data Stack was a real industry trend or only happened on Data Twitter - but I'll only do so over beer 😄
Why can Kedro not be at least as good at ELT as it is at ETL?
I'm not denying this. What I'm saying that, in theory[^1],
$ pip install meltano
$ meltano add extractor tap-postgres
$ meltano add loader target-snowflake
$ meltano run
is the optimal open-source, CLI-based, EL experience, and I don't think Kedro can match this at the moment or in the near future (very happy to be proven wrong).
Edit: Meltano would be EL, then for example dbt would be T, or as Lauren Balik jokingly says, TTTTTTT
[^1]: In theory there is no difference between theory and practice, while in practice there is.
We've implemented an in-house upsert
functionality into one of our Arrow datasets using a method @deepyaman alludes to
You could simulate it in data frames using concat with indices, e.g. in pandas or spark, but it's not very clean.
The write_mode
is just a save_arg
for us. This definitely breaks "reproducibility" though and goes towards idempotency like @astrojuanlu pointed out
Why can Kedro not be at least as good at ELT as it is at ETL?
I'm not denying this. What I'm saying that, in theory
$ pip install meltano $ meltano add extractor tap-postgres $ meltano add loader target-snowflake $ meltano run
is the optimal open-source, CLI-based, ELT experience, and I don't think Kedro can match this at the moment or in the near future
I've never used meltano, but this covers only EL in ELT, right? Kedro+ibis could slot in very nicely into the T, and also provide more-than-good-enough performance for the EL side, though it does seem hard to beat specialized tools like meltano.
Oh, correct. I meant "the optimal [...] EL experience".
I would love to see Kedro fully support the "T", standing as an alternative for dbt for engine base transformation but with a python API. This could bring a huge value for some Data teams that need to juggle between two (or more) differents Technologies/frameworks and throws their works over the wall for others teams, dependencing on the stages of their Data pipelines (DE, DS/ML).
This could significally enlarge Kedro user base, as there is much more volume of work in Data & Analytics engineering than Data science & ML.
As for the "E" and "L" part Kedro could be just good enough.
To achieve this I really believe we should go all in on Ibis as a first class citizen / prefered approach in Kedro. One syntax for broadly the backends we care about enabling the interdisciplinary collaboration @takikadiri mentions.
(From phone) To clarify, I don't think T is the problem, but rather E & L. I suspect some changes in philosophy of even API might be required that go beyond adopting Ibis, the task here is to investigate.
Although T might also require some improvements in how we approach upserts.
At PyData London I spoke to 2 different users about how they were using Kedro for their ETL pipelines and they both have challenges:
Also, while discussing this in person with @deepyaman, I realised that both EL and T data pipelines need upserts anyway, so probably my comments above were somewhat misguided.
Inspiration: "Incremental loads should be replayable by design" (source)
I wonder if we could make start_date
and end_date
first class CLI arguments and part of the session constructor?
Sometimes it would be date, sometimes it would be id... Don't think we can anticipate all possible pagination options. But regardless, I think this is more or less achievable already thanks to runtime parameters, right? The difficult thing is the upsert logic. Not from a technical perspective but from a product philosophy perspective, shifting from a focus on reproducibility to a focus on idempotency (data pipelines and machine learning pipelines might require different approaches)
I think you can generalise upserts into the need for a conditional node...
BTW about Ibis and upserts https://github.com/ibis-project/ibis/issues/5391 cc @deepyaman
Intro and context
Kedro for data pipelines
We have evidence of users using Kedro for authoring data pipelines https://linen-slack.kedro.org/t/16312377/hi-everyone-here-luca-ds-from-italy-happy-kedro-user-for-3-y#2d666fee-5385-45d2-b2f8-4282ef22c2f9
However, there are also some signs that hint that there's margin for improvement for Kedro to be suitable for creating data pipelines:
UPSERT
(akaMERGE
aka "INSERT
orUPDATE
") operations https://github.com/kedro-org/kedro/pull/964The fact that Kedro is not mentioned in any industry survey we have found https://github.com/kedro-org/kedro-devrel/issues/94 is probably a symptom, rather than a cause, of all the above.
There's two sides of this problem:
Next steps
Part of this intersects with https://github.com/kedro-org/kedro/issues/1778, https://github.com/kedro-org/kedro/issues/1936 cc @merelcht
From a product perspective it's worth asking whether we want to pursue making Kedro a suitable tool for ETL/ELT pipelines at all. Regarding ELT, Kedro will probably never be as convenient as the Singer ecosystem and derivatives could theoretically be - however, the practical application of Meltano and Airbyte leaves some gaps, and maybe Kedro could be a satisfying tool for some users. Regarding ETL, I think Kedro could be a perfect framework for this, provided that the datasets, the crucial bits that perform the I/O, are up to the task or at least we provide clear guidance of what is the "Kedronic" way of achieving idempotent data ingestion pipelines that can support cold starts, behave well under changes of the source schema, and any other desirable properties of data pipelines.
From a technical perspective, on the other hand, we need to develop an understanding of how Kedro can be used nowadays for ELT/ETL pipelines following modern data architecture patterns, and evaluate to what extent the pains described above are real or are just a matter of having better docs.
On a related note, discussion in https://github.com/kedro-org/kedro-plugins/issues/471 surfaced that we might have to make some of the Kedro principles more explicit.
Finally, we should execute on messaging/value proposition updates based on the conclusions of our investigation, and probably generate appropriate educational material in the form of documentation, blog posts, and videos.