Open deepyaman opened 1 year ago
I essentially agree with all your points.
xref https://github.com/kedro-org/kedro-plugins/issues/200, https://github.com/kedro-org/kedro/issues/1936, https://github.com/kedro-org/kedro/issues/1981, https://github.com/kedro-org/kedro/issues/1778, and to some extent https://github.com/kedro-org/kedro/issues/2536
It's clear that at some point we need to sit and see if we can come up with a better design.
Context
kedro-datasets
makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.Inspired by:
Agree, although
DeltaTable
is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.
I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?
Originally posted by @noklam in https://github.com/kedro-org/kedro-plugins/issues/243#issuecomment-1639992786
Possible Implementation
The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.
One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.