Open noklam opened 10 months ago
could we have something like catalog.resolve(pipeline:Optional[str]).list()
?
This Viz issue is related: https://github.com/kedro-org/kedro-viz/issues/1480
could we have something like
catalog.resolve(pipeline:Optional[str]).list()
?
That would be perfect! We would need such a thing
@MarcelBeining Can you explains a bit more why you need this? I am thinking about this again because I am trying to build a plugin for kedro and this would come in handy to compile a static version of configuration.
@noklam We try to find kedro datasets for which we have not written a data test, hence we iterate over catalog.list()
. However, if we use dataset factories, the datasets captured with a factory is not listed in catalog.list()
@MarcelBeining Did I understand this question correctly as:
catalog.yml
yet? I have some WIP in https://github.com/noklam/kedro-inspect which explores this idea but I haven't finished it.Does kedro catalog resolve
or kedro catalog list
helps you? If not what are missing?
@noklam "Find which datasets is not written in catalog.yml including dataset factory resolves, yet" , yes
kedro catalog resolve shows what I need, but it is a CLI command and I need it within Python (of course one could use os.system etc, but a simple extension of catalog.list() should not be that hard)
@MarcelBeining Are you integrating this with some extra functionalities? How do you consume this information if this is ok to share?
@noklam
Adding on from our discussion on slack,
kedro catalog resolve
does what I'd want.
But I'd also like that information easily consumable in a notebook (for example).
So if my catalog stores models like:
"{experiment}.model":
type: pickle.PickleDataset
filepath: data/06_models/{experiment}/model.pickle
versioned: true
I would want to be able to (somehow) do something like:
models = {}
for model_dataset in [d for d in catalog.list(*~*magic*~*) if ".model" in d]:
models[model_dataset] = catalog.load(model_dataset)
Its a small thing. But I was kind of surprised to not see my {experiment}.model
entries not listed at all in catalog.list()
.
Another one, bumped to high
priority as discussed in slack.
https://linen-slack.kedro.org/t/18841749/hi-i-have-a-dataset-factory-specified-and-when-i-do-catalog-#eb609fb2-fce6-434d-a652-ffb62eb41e7b
What if DataCatalog
is iterable?
for datasets in data_catalog:
...
I think it's neat @noklam , but I don't know if it's discoverable.
To me DataCatalog.list()
feels more powerful in the IDE than list(DataCatalog)
...
why_not_both.gif
def list(self):
...
def __iter__(self):
return self.list()
I've also wanted to be able to iterate through the datasets for a while, but it raises some unanswered questions:
catalog
(maybe more intuitive) or catalog.datasets
as described in https://github.com/kedro-org/kedro/issues/3916 (maybe more accurate, especially in regard to the "resolving" issue discussed below) ?catalog.list()
by [dataset.name for dataset in catalog]
catalog.search
(which does not exist but is suggested in #3917) by [dataset.name for dataset in catalog if re.match(dataset.name, regex)]
But we always face the same issue: we would need to "resolve" the dataset factory first relatively to a pipeline. it would eventually give: [dataset.name for dataset in catalog.resolve(pipeline)]
, but is it really a better / more intuitive syntax ? I personnaly find it quite neat, but arguably beginners would prefer a "native" method.
The real advantage of doing so is that we do not need to create a search method with all type of supported search (by extension, by regex... as suggested in the corresponding issue) because it's easily customizable, so it's less maintenance burden in the end.
Catalog.list already support regex, isn't that identical to what you suggest as catalog.search?
@noklam you can only search by name, namespaces aren't really supported and you can't search by attribute
namespace is just a prefix string so it works pretty well. I do believe there are benefits to improve it, but I think we should at least add an example for existing feature since @Galileo-Galilei told me he is not aware of it and most likely very few do.
Description
Background: https://linen-slack.kedro.org/t/16064885/when-i-say-catalog-list-in-a-kedro-jupter-lab-instance-it-do#ad3bb4aa-f6f9-44c6-bb84-b25163bfe85c
With dataset factory, the "defintion" of a dataset is not known until the pipeline is run. When user is using a Jupyter notebook, they expected to see the full list of dataset with
catalog.list
.Current workaround to see the datasets for
__default__
pipeline look like this:Context
When using the CLI commands, e.g.
kedro catalog list
we do matching to figure out which factory mentions in the catalog match the datasets used in the pipeline, but when going through the interactive flow no such checking has been done yet.Possible Implementation
Could check dataset existence when the session is created. We need to verify if that has any unexpected side effects.
This ticket is still open scope and we don't have a specify implementation in mind. The person who pick up can evaluate different approaches, with considerations of side-effect, avoid coupling with other components.
Possible Alternatives
catalog.list( pipeline=<name>
) - not a good solution because catalog wouldn't have access to a pipelinekedro catalog list
is called.