Closed hmgaudecker closed 6 months ago
Hi @hmgaudecker, I know nesting data catalogs was part of an early version, but I decided to remove it again since it seems like an odd case, which is better handled by having multiple data catalogs next to each other.
Maybe we can raise an error. Let us see.
Oh, I quite liked it.
What seems definitely necessary in large, complex projects is some sort of nesting structure. E.g. I would loop over model specifications, sample restrictions, etc -- seems much clearer to nest that rather than having very long names at one level of the hierarchy, which would need to be parsed all the time.
E.g., after converting only the data cleaning of a larger project so far, my data catalog directory looks as follows:
.
|____baseline_linear
| |____pooled
| | |____figures
| | |____analysis
| | |____tables
|____clean_data
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9.pkl
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9-node.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65-node.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65.pkl
I really wouldn't want to have to be using
.
|____baseline_linear-pooled-figures
|____baseline_linear-pooled-analysis
|____baseline_linear-pooled-tables
|____clean_data
instead.
(of course it does not matter much what the directory looks like, the important bit is accessing via data_catalog["baseline_linear"]["pooled"]["figures"]
etc.)
FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.
Sorry for thinking out loud a bit. So would the preferred way to achieve the above be to just specify a dictionary of data catalogs?
If so, should DataCatalog(name="baseline_linear/pooled/analysis")
work or better specify the path? (No Windows support necessary)
On a related note and not sure whether docs or bug: If defining an item of a data catalog implicitly, e.g.,
def task_silly(
logs: Annotated[Path, Product] = data["logs"]
):
the default node type is used. When adding to the catalog upfront:
data.add("logs", Path(f"{logs}.db"))
the above code leads to a PathNode being used.
If that is the expected behaviour, it would be good to add it to the docs.
This is also crucial for the section Developing with the data catalog. That is, the nodes that don't have the default type need to have this changed type in the version that I import. Makes perfect sense, of course, but cost me a bit of time until I realised it...
_(of course it does not matter much what the directory looks like, the important bit is accessing via
data_catalog["baseline_linear"]["pooled"]["figures"]
etc.)_FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.
What is documented is still a leftover. It should not be there. If it works, it works by accident.
Since users need to create each data catalog anyways, I thought users can do
data_catalogs = {
"baseline_linear": {
"pooled": {
"analysis": DataCatalog()
"figures": DataCatalog()
"tables": DataCatalog()
}
}
}
It has the same effect and I do not need to support it in a special way.
If that is the expected behaviour, it would be good to add it to the docs.
It is expected behavior. I will update the docs and make it more clear.
Hope it's okay to add thoughts here as I am gaining experience.
A few things that could become clearer IMO:
add
method or via entries
initially) and when to use implicit definitionPathNode
.Not quite related, but the example on hashing PythonNodes is broken because value
is keyword-only.
Location of the documentation
API ref Tutorial How-To Guide
Documentation problem
I tried to start from the basic tutorial, setting up a DataCatalog and then adding another via
.add()
to get a nested structure. Took me forever to realize that I can't do that, but need to specify the structure right with the initial entries.Would be nice to add an example and to add a warning that DataCatalogs can't be added via
DataCatalog.add()