Open lancechua opened 1 year ago
Hi @lancechua thank you for raising this! We have discussed this as a team and have had a good conversation regarding ways we can proceed.
You will see that this issue has now been referenced in the kedro-org/kedro#1778 super issue which is tracking all of the ways we plan to improve the DataCatalog
at a low level.
Your last point:
Adding xarray support https://github.com/kedro-org/kedro-plugins/issues/165 officially requires implementing nuances like cloud file storage, partitioned datasets, lazy loading, etc.
Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original AbstractDataSet
class definition, implementing a new DataSet today requires (probably too much) copy and pasting.
The team are going to have a go at prototyping what this could look like, so you may be interested in subscribing to parent issue to contribute to these design decisions. The initial thinking reflected on the rather flat structure we have today where everything inherits from AbstractDataSet
(or AbstractVersionedDataSet
) and we could utilise OO concepts much better and make implementation/testing/overriding much simpler.
Is the most interesting, a lot of the paint-points to mention have emerged organically. Versioning, fsspec etc. have all arrived in Kedro way later than the original
AbstractDataSet class
definition, implementing a new DataSet today requires (probably too much) copy and pasting.
I feel like too much copy-pasting has been an issue for a really long time. 😅
I would be keen to break down the points of duplication (e.g. default argument handling, fsspec
-related boilerplate, etc.) and add mixins for each (as in kedro-org/kedro#15), because that way we can make progress on resolving these pain points without having to limit ourselves to a universal solution that works for every possible dataset.
Collecting some thoughts from our latest Tech Design session about this topic:
pandas.GenericDataSet
https://docs.kedro.org/en/latest/kedro_datasets.pandas.GenericDataSet.html but it cannot be easily achieved with everythingAbstractDataSet
base class contains lots of logic, can't be easily turned into a protocol https://github.com/kedro-org/kedro/issues/2409#issuecomment-1490596283pandas.CSVDataSet
. There are other problems with doing the “expensive” data download in a load function, e.g. that we’re loading from a remote source each time when we really should just have it in a downloaded source."A user that quite likely got confused when defining their own dataset, most likely because they aren't copy-pasting enough https://www.linen.dev/s/kedro/t/12646523/hi-folks-i-m-trying-to-create-a-custom-kedro-dataset-inherit#ac22d4cb-dd72-4072-905a-f9c5cb352205
I'm also just dropping data from some of our surveys here, the question is "What is the hardest part of working with the Kedro framework?" (Kedro - User Research Study)
Interesting use case: a user trying to perform data access with a highly customized wget
command https://linen-slack.kedro.org/t/14156294/i-m-migrating-to-kedro-for-a-bunch-of-my-projects-for-one-of#a3d857e1-7e31-48f9-98de-439a936786c8
This is not too different from my KaggleDataset
attempt. Maybe there should be a "generic dataset" that allows people to just type a subprocess that will take care of leaving some files in a directory.
Channelling my inner @idanov this does harm reproducibility - personally I think it's a good thing, same for conditional nodes, another things we've held off doing for the same reason.
This is a bit of a tangent, but I think it's worth addressing.
The reality is that people who need "weird" things will do them, whether are supported by Kedro or not. This not only applies to datasets: I've seen people do very weird things with Kedro.
Speaking of datasets, the options our users have are:
There are 2 stances we can take:
I'm firmly in camp (2).
Completely aligned on #2 I think it's related to the wider rigidity questions we have in the framework vs lib debate.
FYI - how prefect do conditionals here looks neat https://github.com/PrefectHQ/prefect/discussions/3545#discussioncomment-106575
Another user who wants to read a TSV file contained in a .tar.gz
https://stackoverflow.com/q/76994906/554319
pandas does not support compressed archives with more than one file, so this requires an intermediate step.
After implementing some new datasets that will hopefully be open sourced soon, I'm adding some more reflections here.
I think that the difficult part lies in three specific points:
fsspec
boilerplate, and(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes load
and save
public methods. Moreover, I'm advocating to favor composition over inheritance.
(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the path
to load
as @lancechua suggested might be a too strong assumption.
Another reason why this is hard is the naming itself: some things we name "datasets" in Kedro are not really a /ˈdā-tə-ˌset/ in the strict sense of the word, but more like "artifacts" or "connectors". I think there is an opportunity here to collect some of the "weird" use cases we already have and come up with better terminology.
(1) was hinted by @lancechua's original comment at the top of this issue, and I agree that we should move towards an architecture that makes
load
andsave
public methods. Moreover, I'm advocating to favor composition over inheritance.
Can't remember who linked it in the issues here somewhere, but after reading the Python subclassing redux article, I also think composition is the better way to go by injecting load
and save
callables as dependencies.
(2) and (3) are trickier though. What's interesting is that one can envision datasets that don't refer to specific files on disk and that are not versioned - and these ones are actually quite easy to implement. As such, adding the
path
toload
as @lancechua suggested might be a too strong assumption.
Also agree. The callables can be fully specified by load_kwargs
(and potentially load_args
) so there isn't a hard requirement for the function signature.
From what I remember about the implementation of versioning, the filename becomes a directory, with timestamp sub-directories. A separate versioner
dependency can be injected to redirect the load
and save
callables to the path of the correct version. This would make the path
parameter a hard requirement, but we can always just pass None
for "pathless" datasets.
The no versioning case would just return the path as is.
A use case that would be good, but potentially annoying to implement, is a cache
for expensive "load" operations like web pulls or API requests. Instead of repeating the expensive request, the flow with subsequent reads would be:
expensive first request -> in memory -> save to disk -> read cache from disk -> in memory -> ...
The cache
can be a dispatcher that takes both the original load
callable, and a load
cache callable.
cache
itself must still be a valid load
callable.
Perhaps we namespace original and cache load params or create a separate load_cache_kwargs
parameter?
The load
callable dispatched would depend on:
use_cache=True
) The no caching case would just always dispatch to the original load
callable.
P.S. Haven't worked on kedro datasets for a while now, but hopefully the above still makes sense.
Thanks a lot for getting back @lancechua! Yes, Hynek has influenced me a lot on this as well. Hope we can prioritize this in 2024.
More evidence of different artifact types: Kubeflow https://www.kubeflow.org/docs/components/pipelines/v2/data-types/artifacts/#artifact-types (via @deepyaman)
Another problem with the current design: when doing post-mortem debugging on datasets, one gets thrown to AbstractDataset.load
:
Now Python 3.7 is EOL is there any advantage to using Protocol types as an alternative/addition to the abstract classes?
I proposed it in another issue and @antonymilne made some good remarks https://github.com/kedro-org/kedro/issues/2409#issuecomment-1490989279 I could be wrong here and in any case I defer the implementation details to the engineering team, but I don't think this can be done without "breaking" the current catalog + abstractdataset model. This would essentially be a "catalog 2.0", which is exciting but also kind of scary.
A user confused about what AbstractDataset
does https://linen-slack.kedro.org/t/16006502/hey-team-i-have-query-regarding-custom-datasets-we-are-imple#6bdb0d80-0f60-41a1-a104-f11fd3877f08
On that last part - we're well overdue a native PyTorch dataset
Another thing to note: some of our underlying datasets have their own mechanisms of reading remote data that don't rely on fsspec https://github.com/pola-rs/polars/pull/13044 plus sometimes our "copy paste" fsspec logic is really obscure and easy to get wrong https://github.com/kedro-org/kedro-plugins/pull/360#issuecomment-1834634344
Making our "fsspec adapter" (not a mixin please) more explicit will make life easier for custom dataset authors and probably for us.
Progress towards making load
and save
public:
Edit: Actually, thinking about it more, should just make
load
andsave
abstract instead of_load
and_save
, to make the new interface clear. Even if just_load
and_save
is being implemented, you mentioned it's going into__init_subclass__
, andload
andsave
will get created there.
Originally posted by @deepyaman in https://github.com/kedro-org/kedro/issues/3920#issuecomment-2191616499
Also, interesting perspective on object stores vs filesystems: https://docs.rs/object_store/latest/object_store/#why-not-a-filesystem-interface
This design provides the following advantages:
- All operations are atomic, and readers cannot observe partial and/or failed writes
- Methods map directly to object store APIs, providing both efficiency and predictability
- Abstracts away filesystem and operating system specific quirks, ensuring portability
- Allows for functionality not native to filesystems, such as operation preconditions and atomic multipart uploads
Originally shared by @MatthiasRoels in https://github.com/kedro-org/kedro-plugins/issues/625#issuecomment-2195327128
Description
IO read write functions typically follow this signature:
Creating custom datasets should ideally be as easy as supplying
load
/save
function(s) that follow this function signature.Context
kedro
supports an extensive range of datasets, but it is not exhaustive. Popular libraries used by relatively niche communities likexarray
andarviz
aren't currently supported.Beyond these examples, unofficially adding support for more obscure datasets would be easier.
Initially, I was looking to implement something like this and asked in the Discord chat if this pattern made sense. Then, @datajoely suggested I open a feature request.
Possible Implementation
We can consider a
Dataset
class factory. MaybeGenericDataset
with a class method.create_custom_dataset(name: str, load: callable, save: callable)
.Usage would look something like
xarrayNetCDF = GenericDataset("xarrayNetCDF", xarray.open_dataset, lambda x, path, **kwargs: x.to_netcdf(path, **kwargs))
. Entries can be added to the data catalog yaml just as with any other custom dataset implementation.Possible Alternatives
LambdaDataset
is very similar but theload
, andsave
are hard coded in the implementation, and cannot be parameterized in the data catalog, as far as I'm awareAbstractDataset
is an option, but this feature request seeks to reduce boilerplate when defining new datasets