mckinsey / vizro

Vizro is a toolkit for creating modular data visualization applications.
https://vizro.readthedocs.io/en/stable/
Apache License 2.0
2.46k stars 109 forks source link

[Tidy] Remove component to data mapping from data manager #451

Closed antonymilne closed 1 month ago

antonymilne commented 2 months ago

Description

Small refactor as part of https://github.com/McK-Internal/vizro-internal/issues/714 in preparation for more exciting changes such as https://github.com/McK-Internal/vizro-internal/issues/753. Commit history also contains some POCs for other parts of that ticket such as making parametrised calls to dynamic data. Getting this PR merged first will lay the foundation for that work.

Now that we have a "proper" scheme for caching, there's no need to store links between component ID and data source name in the data manager. Instead, the captured callable itself contains the data source name that is used to fetch data as data_manager[data_source_name].load(). This simplifies things quite a bit.

Tested by running hatch run example features and hatch run example features/yaml_version (after modifying to include some dynamic data examples), and all "figure" components (Graph, AgGrid, Table) work still.

Reviewers

@petar-qb please check the resulting diff and make sure you understand it, but no need to run any examples since not much has really changed here. I'll talk you through the intermediate commits and work with you on https://github.com/McK-Internal/vizro-internal/issues/753 next week.

@huong-li-nguyen or @maxschulz-COL should be easy to skim through and give a ✅ if all looks good.

Notice

antonymilne commented 1 month ago

Regarding this commit and considering all pros/cons I fully agree that _DynamicData.pd_DataFrameCallable should be a pure function, not a CaptureCallable. 👍

Regardless that CaptureCallable is better for reusing across multiple data sources, and regardless nested function arguments cannot be updated with this approach, does it seem possible to add: _DynamicData.pd_DataFrameCallable as CaptureCallable as an optional approach in the future? (just to allow advanced users to reuse load data function and nesting capabilities)

Thank you for the review @petar-qb! I was waiting until https://github.com/mckinsey/vizro/pull/479 had been finalised before I answered this, because the answer kept on changing...

As it stands, I think this will definitely be possible (even without adding __name__ and __qualname__ to CapturedCallable, although it might be a good idea to do so anyway) because I always make sure that the cache key includes the data source name. This means that you could do:

# @capture("data") doesn't exist but pretend it does
def load_csv(filename):
     return pd.read_csv(filename)

# These are all dynamic data
data_manager["data_x"] = load_csv("data_x.csv")
data_manager["data_y"] = load_csv("data_y.csv")
data_manager["data_z"] = load_csv("data_x.csv") # even the same value should work indepedently

Probably this will work already actually if you do CapturedCallable(load_data) without the correct @capture decorator existing, because everything just relies on duck typing rather than strict checks.

Let's not promote this as a possible approach until we have a really good reason to do so though. Thinking about it again, even the reusability can now be handled using a partial function ok so the only real reason to use CapturedCallable would be that you prefer the syntax or have the nested keys, which seems pretty unlikely. So you'd already do the above as:

def load_csv(filename):
     return pd.read_csv(filename)

# These are all dynamic data
data_manager["data_x"] = partial(load_csv, "data_x.csv")
data_manager["data_y"] = partial(load_csv, "data_y.csv")
data_manager["data_z"] = partial(load_csv, "data_x.csv")