quiltdata / quilt

Quilt is a data mesh for connecting people with actionable data
https://quiltdata.com
Apache License 2.0
1.33k stars 90 forks source link

Add a way to pre-fill metadata on push #2963

Open flying-sheep opened 2 years ago

flying-sheep commented 2 years ago

Details

Hi, the current solution @Cellarity wants to replace with Quilt automatically derives metadata from the current environment. We need that capability in Quilt for parity.

Specifically, we use a jupyterlab extension to insert a UUID into .ipynb file metadata. Our current solution determines the notebook from which push is run, reads out the UUID, and injects it into the metadata.

Design

I can think of two possible design strategies, only one of which I think is feasible.

Alternative Design

We could also make the jupyterlab extension modify the workflow to add the current notebook’s UUID to the list of allowed values. If workflow validation allows for a custom message, we could add one to inform the user of code they’re supposed to use, e.g. ```py import quilt3 as q3 import cyq3 q3.push(..., metadata=dict(**cyq3.meta(), more='keys')) ``` However, @akarve said that frequently modifying a workflow is not ideal, and this increases the amount of code that has to be used, increasing the risk of users copy&pasting code they don’t understand. ----

I think the ideal design would leverage Python’s built in mechanism for extension entry points. Quilt could e.g. discover plugins using the entry_points/EntryPoint API and use it to pre-fill metadata similarly to this:

from importlib.metadata import entry_points

def push(..., metadata=MappingProxyType({})):
    ...
    providers = entry_points(group='quilt_metadata')
    metadata = {  # TODO: deep merge, maybe report conflicts in providers?
        **{p.name: p.load()() for p in providers},
        **metadata,
    }

All a plugin has to do is provide package metadata:

[project.entry-points.quilt_metadata]
'notebook.uuid' = 'quilt_nb_hook:get_nb_uuid'

To round off that feature, a message could tell users to install the pre-fill package.

@akarve, you should have access to https://github.com/Cellarity/quilt-nb-hook. If Quilt had the capabilities as specified above, installing that package would allow us to automatically infer UUID metadata.

akarve commented 2 years ago

This is quite reasonable and simple, too. I would recommend that the entrypoint values and names follow some kind of "before" or "inner" method naming convention. So for instance:

EntryPoint(name='push', value='before', group='quilt3.plugins')

It's very likely that plugins are the cleanest option but, for completeness, though I wonder if we could do something similar and even simpler with a @wraps decorator around push.

flying-sheep commented 2 years ago

The simplest solution for users is to import quilt3 and have it work as expected. I don’t understand where a decorator would come in.

EntryPoint(name='push', value='before', group='quilt3.plugins')

I think you misunderstand. The value needs to be either a path to a module (mod.submod), to an object in a module (mod.submod:funcname) or an attribute of an object in a module (mod.submod:objectname.attrname). The group has to be the same for all plugins of a type. We could either create multiple groups for different types of quilt plugins or just do one and then dispatch based on the object (more on that later). The parts we can control are

  1. the name. As nothing good ever comes out of encoding multiple pieces of metadata into names, it should be a single simple identifier that might have to conform to some validation rule, but otherwise be treated opaquely. E.g. the console_scripts entrypoint group treats the name as command name. It therefore has to be a valid filename on all systems (no : or so), and inserting spaces is a bad idea.
  2. The object itself, which can be anything.

So I can see three different designs:

  1. group == 'quilt3.plugins' and object has to conform to some dispatcher signature. E.g.

    One entrypoint per plugin package pointing to a dispatcher: EntryPoint(name=???, value='mypkg.submod:make_plugin', group='quilt3.plugins')

    # Literal['metadata'], Literal['push', 'pull'], Literal['before', 'after']
    from quilt.plugins import PluginType, EventType, EventModifier
    
    def make_plugin(plugin_type: PluginType, event: EventType, when: EventModifier) -> Any:
       if plugin_type == 'metadata':
           return add_uuid  # just return the function, don’t call it
       raise NotImplementedError()
    
    def add_uuid(*args, metadata: MutableMapping[str, Any], **kw) -> None:
       metadata['uuid'] = infer_uuid()

    Pro: maximum flexibility for plugin authors Con: easy for plugin authors to misuse and accidentally leave out conditions, making their code run on events they weren’t designed for

  2. group.startswith('quilt3.plugins.') and object depends on plugin type, e.g:

    Each entrypoint points to a single function: EntryPoint(name=???, value='mypkg.submod:add_uuid', group='quilt3.plugins.metadata').

    One or more quilt helper methods that validate and attach metadata to the entry point functions:

    from quilt.plugins import metadata as md
    
    @md.hook('before', 'push')
    @md.hook('after', 'pull')
    def add_uuid(*args, metadata: MutableMapping[str, Any], **kw) -> None:
       metadata['uuid'] = infer_uuid()

    Pro: simple to use for plugin authors Con: validation code is now on quilt side, needs to check diligently if combinations of events and modifiers is correct. repetitive to use for plugin authors when defining a bunch of hooks (which have to be defined as both entrypoints and functions)

  3. group == 'quilt3.plugins' and object depends on plugin type, e.g:

    Each entrypoint points to a module and collects hooks in there: EntryPoint(name=???, value='mypkg.submod', group='quilt3.plugins')

    The hooks get metadata attached like above.

    from quilt.plugins import hook
    
    @hook('metadata', 'before', 'push')
    @hook('metadata', 'after', 'pull')
    def add_uuid(*args, metadata: MutableMapping[str, Any], **kw) -> None:
       metadata['uuid'] = infer_uuid()

    Pro: simple to use for plugin authors, they can add whole groups of hooks by specifying a module entrypoint Con: more strings involved, so needs even more validation code that a plugin author specified them correctly. most complex solution to implement: quilt code needs to loop over objects in the module and detect which ones are hooks (instead of just looping entrypoints)

I think the cons for the third option are relatively small ones if the quilt side code is robust, has nice typing, covers cases correctly, and has nice error messages in case of mistakes.

sir-sigurd commented 1 year ago

@flying-sheep Hi!

Your proposal looks like a working option to solve a problem, but seems a bit complicated to me. Maybe I don't entirely understand the problem, but could you instead use a wrapper for Package.push() that injects needed metadata?

flying-sheep commented 1 year ago

Sure, but any checks that can be circumvented by not using the wrapper are less effective of course.

I no longer work at Cellarity, so I don’t know what their current needs are anyway.