pydiverse / pydiverse.pipedag

A data pipeline orchestration library for rapid iterative development with automatic cache invalidation allowing users to focus writing their tasks in pandas, polars, sqlalchemy, ibis, and alike.
https://pydiversepipedag.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Fix `deep_map` #138

Open nicolasmueller opened 6 months ago

nicolasmueller commented 6 months ago

Currently deep_map is not aware of dict_values. This resulted in input_tables not being populated in MaterializationWrapper.__call__().

Checklist

NMAC427 commented 6 months ago

I'm not sure if this is the correct approach to solve this issue. Either convert dict_values to a list / tuple before passing it to deep_map, or we devise a more general solution. Specifically, this issue applies to a variety of other iterable objects as well:

I don't know what the optimal solution for this problem would be.


Some thing I would like to note:

windiana42 commented 6 months ago

How about converting Iterables != tuple to list and Mapping to dict? Of course we need to be careful about the order in which we check for isinstance of those general types.

    if cls == tuple:
        y = _deep_map_tuple(x, fn, memo)
    elif isinstance(cls, Mapping):
        y = _deep_map_dict(x, fn, memo)
    elif isinstance(cls, Iterable):
        y = _deep_map_list(x, fn, memo)
    else:
        y = fn(x)
windiana42 commented 5 months ago

@NMAC427 what do you think about my proposal to make any Mapping a dictionary and any Iterable != tuple a list?

NMAC427 commented 5 months ago

@windiana42 it depends on how exactly you identify mappings and iterables. For example, pd.DataFrame inherits from the Iterable ABC, but converting it to a list would be fatal, because we use deep_map for the auto table feature. And even if we decide to convert iterables, I think it would be preferential to convert them to tuples (instead of lists) to keep them immutable.

NMAC427 commented 5 months ago

Theoretically any user code could quite easily extend deep_map for any arbitrary iterable by providing the appropriate mapping function. For example:

def f(x):
    if isinstance(x, SomeContainer):
        return deep_map(f, list(x))
    return str(x)

c = SomeContainer(...)
deep_map(f, c)

However, the inverse (preventing deep_map from deep mapping an iterable) is impossible. Therefore we should be quite careful to ensure we don't accidentally make the problem worse.

windiana42 commented 5 months ago

I think dict().values() should be converted to list or tuple. I think list, tuple, and dict deserve extra status in the sense that we want to keep them in-tact. We should make sure nothing bad happens to pipedag.Table() objects and sources of auto-table conversion should also be kept as is until they are converted to pipedag.Table().