meltano / hub

The single source of truth for all Meltano plugins, including all available Singer Taps and Targets: https://hub.meltano.com
https://hub.meltano.com
49 stars 66 forks source link

Add data-diff utility plugin #669

Open danthelion opened 2 years ago

danthelion commented 2 years ago

https://github.com/datafold/data-diff

Would be pretty neat to quickly verify loader jobs. I would love to implement this I'm just not sure where to start with utility plugins. In discovery.yml there are entries pointing to GitLab repos that have been deprecated in favor of the GitHub version but those are not in this config file, what's the process of adding a utility like this?

pnadolny13 commented 2 years ago

@danthelion thats a great idea! Thanks for opening this issue.

I havent had a chance to use data-diff yet but it looks very cool from what I've been hearing.

The plugins that are discoverable by Meltano are sourced from the hub so a simple implementation could be to define the package, its settings, relevant commands, and options docs for the hub plugin page. Check out https://github.com/meltano/hub/blob/main/_data/meltano/utilities/datahub/datahub-project.yml which is a relatively recent utility that someone contributed. You can just open a PR with that plugin definition yaml and a logo image.

Other helpful context:

The current state is that utility plugins are assumed to be pretty simple so defining settings/commands is sufficient but we're finding that it would be helpful to enable plugin developers to build "glue code" for the utility (see the Python-based Plugin Architecture issue), things like pre and post hooks for prep and cleanup work or other things to simplify the integration between plugins in your meltano project/stack. A few good examples are Airflow and Superset which are implemented in Meltano core right now and do things like compile config files, init databases, etc. then clean up after themselves on exit.

Theres work slated for next week to start pulling Superset/Airflow's glue code out of Meltano core and into standalone python packages, starting to iron out a framework and paving the way for other plugins.

I'm giving the extra context in case you have some ideas while working with data-diff around simplifying the startup/integration work needed to use the tool that go beyond just settings/commands. Things like auto configuring data-diff based on the Meltano extractor/loader or something like that 🤷 . We should have a way to support those integrations soon!