meltano / meltano

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
https://meltano.com/
MIT License
1.84k stars 165 forks source link

Support "near-instant" plugin installation #7004

Open aaronsteers opened 1 year ago

aaronsteers commented 1 year ago

Assumptions

  1. We're running on Docker with explicit control of the base image.
  2. Meltano is already installed.
  3. The container may or may not have a .meltano folder created as of yet.
  4. We have cached one or more venv folder(s) running on the same exact Docker image (meaning: also same exact version of Pip+Python).
  5. We are controlling for changes in pip_url via a hash or other equality constraint.

Requirements

  1. Assuming all of the above have been controlled for, allow 'near-instant' plugin installation via xcopy into the correct location. OR:
  2. Allow Meltano to read from a different already-existing directory when searching for a specific plugin's venv.

Alternatives

Rather than tackling this on a per-plugin-venv basis, we could instead tackle at the 'total' level, either grafting in the entire folder collection of venvs for plugins that match the project - or else grafting in something like the entire .meltano folder, cached from a prior run when all plugin definitions and pip URLs are confirmed to have not changed.

This comparison could be performed by comparing the cumulative pip_url hash-of-hashes for all installed plugins, for instance. If no plugin pip_urls have changed, then we can reuse the full cache.

If we took the path of caching at the .meltano level, we could optionally run a 'clean' step prior to caching, which would remove things like the systemdb or other plugin's artifacts that are unrelated to plugin installations. (E.g. dbt logs, and other plugin-specific temp artifacts.)

Related

cc @WillDaSilva, @kgpayne

aaronsteers commented 1 year ago

@z3z1ma - I think you mentioned you have a working implementation via pex? Is that right?

https://pex.readthedocs.io/en/v2.1.123/

z3z1ma commented 1 year ago

@aaronsteers yes I do. I actually plan to drop the project on github soon so you can dig in yourself!

aaronsteers commented 1 year ago

Very timely: Dagster is also moving to PEX for some of their Python installs. There are several interesting lessons in their post on this topic.

https://dagster.io/blog/fast-deploys-with-pex-and-docker

What is PEX?

Short for Python Executable, pex is a tool that bundles Python packages into files called pex files. These are executable files that carry Python packages and some bootstrap code within them...

Using pex in combination with S3 for storing the pex files, we built a system where the fast path avoids the overhead of building and launching Docker images.

...

Trade-offs and issues

...pex can only build pex files for Linux for packages that provides wheels.

GitHub workflows and pex

...We used to package our GitHub action code into a Docker image and used the Docker container action. Instead, we now package our action code as a pex file, which we check into our action repository and run directly on the GitHub runner. This eliminates the time spent downloading and launching the Docker action image, while still allowing us to package all dependencies.

aaronsteers commented 1 year ago

I got this working today with shiv, another alternative to pex.

Prereqs: a valid Meltano project, with meltano and pipx installed

Steps to reproduce

Step 1: Build the pyz file.

Just so the later code can be generic and easier to read:

export PLUGIN_NAME=target-duckdb
export PIP_URL=target-duckdb
pipx run shiv -c $PLUGIN_NAME -o ./$PLUGIN_NAME.pyz $PIP_URL

Which resolves to:

pipx run shiv -c target-duckdb -o ./target-duckdb.pyz target-duckdb.pyz

Note: Using pipx run, we don't actually need to preinstall shiv. Alternatively, you could pip install it and then invoke it directly.

Step 2: Test the executable

aj@ajs-macbook-pro jaffle-shop-template % ./target-duckdb.pyz --help    
usage: target-duckdb [-h] [-c CONFIG]

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Config file

Step 3: Update meltano.yml

In meltano.yml, add a new executable entry with the absolute path:

# ...
  loaders:
  - name: target-duckdb
    variant: jwills
    pip_url: target-duckdb~=0.4
    executable: /Users/aj/Source/jaffle-shop-template/target-duckdb.pyz
# ...

Step 4: Invoke with Meltano

aj@ajs-macbook-pro jaffle-shop-template %  meltano invoke target-duckdb --help
2023-03-12T22:08:16.433582Z [info     ] Environment 'dev' is active
usage: target-duckdb [-h] [-c CONFIG]

options:
  -h, --help            show this help message and exit
  -c CONFIG, --config CONFIG
                        Config file

This works even if you've not yet executed meltano install. 🚀

Notes / Takeaways

  1. .pyz has a PEP associated with it, and is a recommended extension for python executables.
  2. There was no way I could find to nullify the pip_url to make it skip the plugin during meltano install. The best I could do was to replace with a pip_url: noop to install a package that's very small.
  3. I only got this working using absolute paths.
  4. The file extension doesn't matter. You can also create the file with no extension, and it still works.
  5. Files size was 15MB for target-duckdb (installed from PyPi) and 43MB for target-jaffle-shop (installed from source).
  6. I noticed a ~2 second slowdown on first invocation (probably during initial decompression) and then execution times seemed identical.
  7. In theory, we could commit these back to the repo - or store within S3 or similar and then copy in the .pyz file at runtime.

Challenges

(Updated)

  1. Overriding executable at runtime. Ideally we'd want to be able to override executable dynamically without changing the meltano.yml file contents. One way to do so would be with env vars. In theory, I hoped <PLUGIN_NAME>__EXECUTABLE might have worked for this, but I could not get it running in my tests.
    • Mitigation: Investigate if <PLUGIN_NAME>__EXECUTABLE can be parsed at runtime along with other plugin-level settings.
  2. No functional spec to override pip_url if it's set in the plugin lock file. Omitting it doesn't work, since the lock file exists to make it optional anyway.
    • Mitigation (unsetting pip_url): Add logic to interpret pip_url: NONE, or pip_url: null, or `pip_url: '~'' as an override, functionally unsetting the value that would be otherwised used from the lock file.
    • Mitigation (configurability): Combine this with the env var parsing technique suggested for executable above, so that export <PLUGIN_NAME>__PIP_URL='~' would cause meltano to skip installation when not needed.
  3. Inability to declare multiple executables or entry points. The .pyz paradigm works well for taps and targets and other plugins that only have a single executable to call. It's unclear how we'd implement this for something like dbt-ext, where some commands call the extension and some call dbt directly.
    • Mitigation (EDK-based plugins): Refactor EDK-based plugins to call all commands from the extension CLI as the entrypoint. This is possible because the extension has a passthrough command already, so anything that can be sent to the wrapped executable can also be passed to the extension.
    • Mitigation (python interpreter approach): We could alternatively consider redefining core paradigms. For instance, Meltano could treat the zip/pyz file as a custom venv itself, and build custom handling into Meltano to call arbitrary executables from within the context of that zip. However, this is a lot more work and complexity.
WillDaSilva commented 1 year ago

Comparison of Shiv and pex: https://shiv.readthedocs.io/en/latest/history.html#motivation-comparisons

visch commented 1 year ago

https://discuss.python.org/t/pip-plans-to-introduce-an-alternative-zipapp-deployment-method/17431/2 writeup about PIP @aaronsteers

@WillDaSilva found this https://discuss.python.org/t/allow-uploading-pyz-zipapp-files-to-pypi/19263

pnadolny13 commented 1 year ago

I brought this up in office hours but it might be helpful to split our pip_url into an explicit list vs a space separated string. dbt implements their package in a similar way, see https://docs.getdbt.com/docs/build/packages:

packages:
  - package: dbt-labs/snowplow
    version: 0.7.0

  - git: "https://github.com/dbt-labs/dbt-utils.git"
    revision: 0.9.2

  - local: /opt/dbt/redshift

This could help for the issue that AJ brought up where you have a mix of packages, some pex, some git urls, some etc. It also could help with setting version ranges in the future.

stale[bot] commented 1 year ago

This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen label, or request that it be added.