rstudio / pins-python

https://rstudio.github.io/pins-python/
MIT License
50 stars 12 forks source link

Dynamically load drivers based on object, or metadata #18

Open machow opened 2 years ago

machow commented 2 years ago

TODO: run example code

In python-pins adaptors are needed to load and save data objects to/from disk. For example, below is one that might be implemented for a pandas DataFrame, or joblib thing:

# libraries used in concrete driver implementations
import pandas as pd
import joblib

# Driver interface and creation =====================

class IDriver(Protocol):
    def __init__(self, obj):
        self.obj = obj

    @staticmethod
    def load(fname, ...):
        """Return loaded object."""
        ...

    def save(self, fname, ...):
        """Save object to disk."""
        ...

def create_saver(obj):
    """Return the right driver for the given object."""

    if isinstance(obj, pd.DataFrame):
        return DriverPandas(obj)
    else:
        return DriverJoblib(obj)

def create_loader(fname):
    """Return the right driver class with a .load() method."""
    # choose each loader based on filename
    # not a good idea, but shows the rough concept!
    if fname.endswith("csv"):
        return DriverPandas
    elif fname.endswith("joblib"):
        return DriverJoblib

    raise NotImplementedError()

# Concrete Drivers ============================    

# Pandas Driver ----

class DriverPandas(IDriver):
    @staticmethod
    def load(fname):
        return pd.read_csv(fname)

    def save(self, fname, obj):
        obj.to_csv(fname)

# Joblib Driver ----

class DriverJoblib(IDriver):
    @staticmethod
    def load(fname):
        return joblib.load(open(fname, "rb"))

    def save(self, fname, obj):
        joblib.dump(obj, fname)

There are two big challenges with these adaptors:

Handling dependencies

For saving, pins can use information associated with an object (e.g. its library name, etc..) to dynamically import the correct driver. This happens in siuba, where the name of a database dialect, determines which file in this dialects folder gets imported (happens in this get_dialect_translator function, which uses the importlib module).

For loading, pins could use information in the pin metadata (e.g. metadata.type) to decide which driver to import.

Extension

A common approach for extension is to use entrypoints (see this sqlalchemy doc). With this approach, libraries can basically say "I will handle objects from library X", or "import and check in with me for library X".

In this case drivers could have an accepts() -> Bool method. So we would see an object from sklearn, and import all drivers that register a pins.sklearn entrypoint, then go down the line calling their accepts() method, to see if they can handle the object.

machow commented 2 years ago

Toying with this to get around the problem: importing a million optional libraries:

https://github.com/machow/databackend