Open aaronsteers opened 1 year ago
@z3z1ma - I think you mentioned you have a working implementation via pex
? Is that right?
@aaronsteers yes I do. I actually plan to drop the project on github soon so you can dig in yourself!
Very timely: Dagster is also moving to PEX for some of their Python installs. There are several interesting lessons in their post on this topic.
https://dagster.io/blog/fast-deploys-with-pex-and-docker
What is PEX?
Short for Python Executable, pex is a tool that bundles Python packages into files called pex files. These are executable files that carry Python packages and some bootstrap code within them...
Using pex in combination with S3 for storing the pex files, we built a system where the fast path avoids the overhead of building and launching Docker images.
...
Trade-offs and issues
...pex can only build pex files for Linux for packages that provides wheels.
GitHub workflows and pex
...We used to package our GitHub action code into a Docker image and used the Docker container action. Instead, we now package our action code as a pex file, which we check into our action repository and run directly on the GitHub runner. This eliminates the time spent downloading and launching the Docker action image, while still allowing us to package all dependencies.
I got this working today with shiv
, another alternative to pex
.
Prereqs: a valid Meltano project, with meltano
and pipx
installed
pyz
file.Just so the later code can be generic and easier to read:
export PLUGIN_NAME=target-duckdb
export PIP_URL=target-duckdb
pipx run shiv -c $PLUGIN_NAME -o ./$PLUGIN_NAME.pyz $PIP_URL
Which resolves to:
pipx run shiv -c target-duckdb -o ./target-duckdb.pyz target-duckdb.pyz
Note: Using pipx run
, we don't actually need to preinstall shiv
. Alternatively, you could pip install it and then invoke it directly.
aj@ajs-macbook-pro jaffle-shop-template % ./target-duckdb.pyz --help
usage: target-duckdb [-h] [-c CONFIG]
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Config file
meltano.yml
In meltano.yml
, add a new executable
entry with the absolute path:
# ...
loaders:
- name: target-duckdb
variant: jwills
pip_url: target-duckdb~=0.4
executable: /Users/aj/Source/jaffle-shop-template/target-duckdb.pyz
# ...
aj@ajs-macbook-pro jaffle-shop-template % meltano invoke target-duckdb --help
2023-03-12T22:08:16.433582Z [info ] Environment 'dev' is active
usage: target-duckdb [-h] [-c CONFIG]
options:
-h, --help show this help message and exit
-c CONFIG, --config CONFIG
Config file
This works even if you've not yet executed meltano install
. 🚀
.pyz
has a PEP associated with it, and is a recommended extension for python executables.pip_url
to make it skip the plugin during meltano install
. The best I could do was to replace with a pip_url: noop
to install a package that's very small.target-duckdb
(installed from PyPi) and 43MB for target-jaffle-shop
(installed from source).(Updated)
executable
at runtime. Ideally we'd want to be able to override executable dynamically without changing the meltano.yml
file contents. One way to do so would be with env vars. In theory, I hoped <PLUGIN_NAME>__EXECUTABLE
might have worked for this, but I could not get it running in my tests.
<PLUGIN_NAME>__EXECUTABLE
can be parsed at runtime along with other plugin-level settings.pip_url
if it's set in the plugin lock file. Omitting it doesn't work, since the lock file exists to make it optional anyway.
pip_url
): Add logic to interpret pip_url: NONE
, or pip_url: null
, or `pip_url: '~'' as an override, functionally unsetting the value that would be otherwised used from the lock file.executable
above, so that export <PLUGIN_NAME>__PIP_URL='~'
would cause meltano to skip installation when not needed. .pyz
paradigm works well for taps and targets and other plugins that only have a single executable to call. It's unclear how we'd implement this for something like dbt-ext
, where some commands call the extension and some call dbt
directly.
Comparison of Shiv and pex: https://shiv.readthedocs.io/en/latest/history.html#motivation-comparisons
https://discuss.python.org/t/pip-plans-to-introduce-an-alternative-zipapp-deployment-method/17431/2 writeup about PIP @aaronsteers
@WillDaSilva found this https://discuss.python.org/t/allow-uploading-pyz-zipapp-files-to-pypi/19263
I brought this up in office hours but it might be helpful to split our pip_url into an explicit list vs a space separated string. dbt implements their package in a similar way, see https://docs.getdbt.com/docs/build/packages:
packages:
- package: dbt-labs/snowplow
version: 0.7.0
- git: "https://github.com/dbt-labs/dbt-utils.git"
revision: 0.9.2
- local: /opt/dbt/redshift
This could help for the issue that AJ brought up where you have a mix of packages, some pex, some git urls, some etc. It also could help with setting version ranges in the future.
This has been marked as stale because it is unassigned, and has not had recent activity. It will be closed after 21 days if no further activity occurs. If this should never go stale, please add the evergreen
label, or request that it be added.
Assumptions
.meltano
folder created as of yet.pip_url
via a hash or other equality constraint.Requirements
venv
.Alternatives
Rather than tackling this on a per-plugin-venv basis, we could instead tackle at the 'total' level, either grafting in the entire folder collection of venvs for plugins that match the project - or else grafting in something like the entire
.meltano
folder, cached from a prior run when all plugin definitions and pip URLs are confirmed to have not changed.This comparison could be performed by comparing the cumulative
pip_url
hash-of-hashes for all installed plugins, for instance. If no plugin pip_urls have changed, then we can reuse the full cache.If we took the path of caching at the
.meltano
level, we could optionally run a 'clean' step prior to caching, which would remove things like the systemdb or other plugin's artifacts that are unrelated to plugin installations. (E.g. dbt logs, and other plugin-specific temp artifacts.)Related
python-setup
action proves this is possible, with near-instant bootup of projects and no installation needed when the precheck of the installation definition hashes matches an existing cache.cc @WillDaSilva, @kgpayne