kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
90 stars 82 forks source link

[DRAFT] Separate file format from processing engine in datasets #273

Open deepyaman opened 1 year ago

deepyaman commented 1 year ago

Context

  1. There is currently no clear consistency in what a dataset does; it loads (or, in cases like Spark, connects to) data in some format, and then you need to make sure the node consuming the dataset matches the loaded format. This means you can never truly have a separation of node from dataset, and swap one out without changing the other, unless they are "compatible" (by unenforceable rules).
  2. If you want to support a new file format (e.g. Delta), you need to write a connector for each engine. In many cases, it makes sense (perhaps there's nothing to reuse between the way Spark will load Delta and pandas will load Delta). In other cases, perhaps it should be possible to not have to define the loader in each place? Especially with things like dataframe exchange protocols coming into the picture.
  3. The current design of kedro-datasets makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.

Inspired by:

          > By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:

Apache Spark delta-rs, a non-Spark approach (this PR) Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

Originally posted by @noklam in https://github.com/kedro-org/kedro-plugins/issues/243#issuecomment-1639992786

Possible Implementation

The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.

One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.

astrojuanlu commented 1 year ago

I essentially agree with all your points.

xref https://github.com/kedro-org/kedro-plugins/issues/200, https://github.com/kedro-org/kedro/issues/1936, https://github.com/kedro-org/kedro/issues/1981, https://github.com/kedro-org/kedro/issues/1778, and to some extent https://github.com/kedro-org/kedro/issues/2536

It's clear that at some point we need to sit and see if we can come up with a better design.