pytask-dev / pytask

pytask is a workflow management system that facilitates reproducible data analyses.
https://pytask-dev.readthedocs.io/en/stable
Other
114 stars 10 forks source link

DOC: Details for data catalog #572

Closed hmgaudecker closed 6 months ago

hmgaudecker commented 8 months ago

Location of the documentation

API ref Tutorial How-To Guide

Documentation problem

I tried to start from the basic tutorial, setting up a DataCatalog and then adding another via .add() to get a nested structure. Took me forever to realize that I can't do that, but need to specify the structure right with the initial entries.

Would be nice to add an example and to add a warning that DataCatalogs can't be added via DataCatalog.add()

tobiasraabe commented 8 months ago

Hi @hmgaudecker, I know nesting data catalogs was part of an early version, but I decided to remove it again since it seems like an odd case, which is better handled by having multiple data catalogs next to each other.

Maybe we can raise an error. Let us see.

hmgaudecker commented 8 months ago

Oh, I quite liked it.

What seems definitely necessary in large, complex projects is some sort of nesting structure. E.g. I would loop over model specifications, sample restrictions, etc -- seems much clearer to nest that rather than having very long names at one level of the hierarchy, which would need to be parsed all the time.

E.g., after converting only the data cleaning of a larger project so far, my data catalog directory looks as follows:

.
|____baseline_linear
| |____pooled
| | |____figures
| | |____analysis
| | |____tables
|____clean_data
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9.pkl
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9-node.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65-node.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65.pkl

I really wouldn't want to have to be using

.
|____baseline_linear-pooled-figures
|____baseline_linear-pooled-analysis
|____baseline_linear-pooled-tables
|____clean_data

instead.

(of course it does not matter much what the directory looks like, the important bit is accessing via data_catalog["baseline_linear"]["pooled"]["figures"] etc.)

FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.

hmgaudecker commented 8 months ago

Sorry for thinking out loud a bit. So would the preferred way to achieve the above be to just specify a dictionary of data catalogs?

If so, should DataCatalog(name="baseline_linear/pooled/analysis") work or better specify the path? (No Windows support necessary)

hmgaudecker commented 8 months ago

On a related note and not sure whether docs or bug: If defining an item of a data catalog implicitly, e.g.,

def task_silly(
        logs: Annotated[Path, Product] = data["logs"]
):

the default node type is used. When adding to the catalog upfront:

data.add("logs", Path(f"{logs}.db"))

the above code leads to a PathNode being used.

If that is the expected behaviour, it would be good to add it to the docs.

hmgaudecker commented 8 months ago

This is also crucial for the section Developing with the data catalog. That is, the nodes that don't have the default type need to have this changed type in the version that I import. Makes perfect sense, of course, but cost me a bit of time until I realised it...

tobiasraabe commented 8 months ago

_(of course it does not matter much what the directory looks like, the important bit is accessing via data_catalog["baseline_linear"]["pooled"]["figures"] etc.)_

FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.

What is documented is still a leftover. It should not be there. If it works, it works by accident.

Since users need to create each data catalog anyways, I thought users can do

data_catalogs = {
    "baseline_linear": {
        "pooled": {
            "analysis": DataCatalog()
            "figures": DataCatalog()
            "tables": DataCatalog()
        }
    }
}

It has the same effect and I do not need to support it in a special way.

If that is the expected behaviour, it would be good to add it to the docs.

It is expected behavior. I will update the docs and make it more clear.

hmgaudecker commented 8 months ago

Hope it's okay to add thoughts here as I am gaining experience.

A few things that could become clearer IMO:

hmgaudecker commented 8 months ago

Not quite related, but the example on hashing PythonNodes is broken because value is keyword-only.