Open ElenaKhaustova opened 2 weeks ago
Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:
Wouldn't it be possible to force datasets to only have static, primitive properties in the __init__
method so that serialising them is trivial?
For example, rather than having
class GBQQueryDataset:
def __init__(self, ...):
...
self._credentials = google.oauth2.credentials.Credentials(**credentials)
self._client = google.cloud.bigquery.Client(credentials=self._credentials)
def _exists(self) -> bool:
table_ref = self._client...
we do
class GBQQueryDataset(pydantic.BaseModel):
credentials: dict[str, str]
def _get_client(self) -> google.cloud.bigquery.Client:
return bigquery.Client(credentials=google.oauth2.credentials.Credentials(**self.credentials))
def _exists(self) -> bool:
table_ref = self._get_client()...
?
(I picked Pydantic here given that there's prior art but dataclasses would work similarly)
Non-serializable objects or objects required additional logic implemented at the level of dataset to save/load them:
Wouldn't it be possible to force datasets to only have static, primitive properties in the
__init__
method so that serialising them is trivial?
That would be an ideal option, as a common solution would work out of the box without corner cases. However, it would require more significant changes on the datasets' side.
As a temporal solution without breaking change, we can try extending parent AbstractDataset.to_config()
at the dataset level for those datasets and serializing such objects. However, I cannot guarantee that we'll be able to cover all the cases.
from kedro.io import KedroDataCatalog, Version
from kedro_datasets.pandas import ExcelDataset
config = {
"cached_ds": {
"type": "CachedDataset",
"versioned": "true",
"dataset": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/reviews.csv",
"credentials": "cached_ds_credentials",
},
"metadata": [1, 2, 3]
},
"cars": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/reviews.csv"
},
"{dataset_name}": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/{dataset_name}.csv"
},
"boats": {
"type": "pandas.CSVDataset",
"filepath": "data/01_raw/companies.csv",
"credentials": "boats_credentials",
"save_args": {
"index": False
}
},
"cars_ibis": {
"type": "ibis.FileDataset",
"filepath": "data/01_raw/reviews.csv",
"file_format": "csv",
"table_name": "cars",
"connection": {
"backend": "duckdb",
"database": "company.db"
},
"load_args": {
"sep": ",",
"nullstr": "#NA"
},
"save_args": {
"sep": ",",
"nullstr": "#NA"
}
},
}
credentials = {
"boats_credentials": {
"client_kwargs": {
"aws_access_key_id": "<your key id>",
"aws_secret_access_key": "<your secret>"
}
},
"cached_ds_credentials": {
"test_key": "test_val"
},
}
version = Version(
load="fake_load_version.csv", # load exact version
save=None, # save to exact version
)
versioned_dataset = ExcelDataset(
filepath="data/01_raw/shuttles.xlsx", load_args={"engine": "openpyxl"}, version=version
)
def main():
catalog = KedroDataCatalog.from_config(config, credentials)
_ = catalog["reviews"]
catalog["versioned_dataset"] = versioned_dataset
catalog["memory_dataset"] = "123"
print("-" * 20, "Catalog", "-" * 20)
print(catalog, "\n")
print("-" * 20, "Catalog to config", "-" * 20)
_config, _credentials, _load_version, _save_version = catalog.to_config()
print(_config, "\n")
print(_credentials, "\n")
print(_load_version, "\n")
print(_save_version, "\n")
print("-" * 20, "Catalog from config", "-" * 20)
_catalog = KedroDataCatalog.from_config(_config, _credentials, _load_version, _save_version)
# Materialize datasets
for ds in _catalog.values():
pass
print(_catalog, "\n")
print("-" * 20, "Catalog from config to config", "-" * 20)
_config, _credentials, _load_version, _save_version = _catalog.to_config()
print(_config, "\n")
print(_credentials, "\n")
print(_load_version, "\n")
print(_save_version, "\n")
if __name__ == "__main__":
main()
Description
Implement
KedroDataCatalog.to_config()
method as a part of catalog serialization/deserialization feature https://github.com/kedro-org/kedro/issues/3932Context
Requirements:
from_config
, soKedroDataCatalog.to_config()
have to output configuration further used with the existingKedroDataCatalog.from_config()
method to load it. method https://github.com/kedro-org/kedro/blob/9464dc716c987ac0bcadba49aa97a4fa1ae18248/kedro/io/kedro_data_catalog.py#L268Implementation
Solution description
We consider 3 different ways of loading datasets:
dataset.from_config()
method to instantiate dataset which calls the underlying dataset constructor.1 - can be solved at the catalog level 2 and 3 require retrieving dataset configuration from instantiated dataset object
Solution for 2 and 3 avoiding existing datasets' modifications (as per requirements)
AbstractDataset.__init_subclass__
which allows to change the behavior of subclasses from inside theAbstractDataset
: https://docs.python.org/3/reference/datamodel.html#customizing-class-creationAbstractDataset
in the_init_args
field.AbstractDataset.to_config()
to retrieve configuration from the instantiated dataset object based on the object's_init_args
.Implement
KedroDataCatalog.to_config
Once 2 and 3 are solved, we can implement a common solution at the catalog level. For that, we need to consider cases when we work with lazy and materialized datasets and retrieve configuration from the catalog or using
AbstractDataset.to_config()
.After the configuration is retrieved, we need to "unresolve" the credentials and keep them in a separate dictionary, as we did when instantiating the catalog. For that
CatalogConfigResolver.unresolve_config_credentials()
method can be implemented to undo the result ofCatalogConfigResolver._resolve_config_credentials()
.Excluding parameters and
MemoryDataset
sWe need to exclude
MemoryDataset
s as well asparameters
Not covered cases
Connection
(from google.oauth2.credentials import Credentials
) - https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-5.1.0/_modules/kedro_datasets/pandas/gbq_dataset.html#GBQQueryDatasettype[AbstractDataset]
- https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets- 5.1.0/_modules/kedro_datasets/partitions/incremental_dataset.html#IncrementalDatasetAbstractDataset.to_config()
at the dataset level to serialize those objects. Can be addressed one by one in a separate PRs.LambdaDataset
- not the case anymore since https://github.com/kedro-org/kedro/issues/4292SharedMemoryDataset
- not expected to be saved and loaded.Issues blocking further implementation
VERSIONED_FLAG_KEY
ifversion
is provided.save_version
should we save and load back: https://github.com/kedro-org/kedro/issues/4327 - needs a dicussion.Tested with
CachedDataset
,PartitionedDataset
,IncrementalDataset
,MemoryDataset
and various other kedro datasetsHow to test
https://github.com/kedro-org/kedro/issues/4329#issuecomment-2488586906