harrystech / arthur-redshift-etl

ELT Code for your Data Warehouse
MIT License
25 stars 11 forks source link

Docker setup fails with permissions error #237

Open bhtucker opened 3 years ago

bhtucker commented 3 years ago

Summary

After a fresh install, attempted to run arthur:

./bin/run_arthur.sh 
++ pwd
+ docker run --rm --interactive --tty {volumes omitted} --env DATA_WAREHOUSE_CONFIG=/opt/data-warehouse/warehouse_config --env ARTHUR_DEFAULT_PREFIX=bhtucker arthur-redshift-etl:latest
+ cd /opt/src/arthur-redshift-etl
+ python3 setup.py --quiet develop
error: could not create 'python/redshift_etl.egg-info': Permission denied

Details

Prior steps were only git clone and ./bin/build_arthur.sh

Propose label: Bug, maybe documentation bug?

tvogels01 commented 3 years ago

Here's what I ran:

docker image rm arthur-redshift-etl
docker system prune
rm -rf arthur-redshift-etl
git clone git@github.com:harrystech/arthur-redshift-etl.git
cd arthur-redshift-etl
bin/build_arthur.sh
bin/run_arthur.sh

This sequence of commands puts me into a Docker image running Arthur.

Here's the state of next:

6265949 (HEAD -> next, origin/next, origin/HEAD) Merge branch 'master' into next
9e1568c Merge pull request #236 from harrystech/flake8-fixes
d32b550 (tag: v1.28.0, origin/master) Merge remote-tracking branch 'origin/next'

(The additional commit on next just changed some comments, not related to our Docker setup.)

Please be sure to be on the latest version as shown above. "permission denied" may have popped up during development on next when the user arthur inside the Docker container wasn't the owner of /opt/src.

To debug, please re-run the docker command and add: --entrypoint bash right before the image tag. In the shell, please take a look at:

ls -la /opt/src/arthur-redshift-etl/
cd /opt/src/arthur-redshift-etl/
touch hello
python setup.py develop

and let me know what error messages show up.

bhtucker commented 3 years ago

Will do, thanks @tvogels01 !

bhtucker commented 3 years ago

Some more info:

First, the 'debug probe' commands on the fresh image, no volumes linked:

(venv) (aws:, prefix:) $ 
(venv) (aws:, prefix:) $ ls -la /opt/src/arthur-redshift-etl/
total 116
drwxr-xr-x 1 arthur arthur  4096 Aug 28 16:46 .
drwxr-xr-x 1 arthur arthur  4096 Aug 28 16:45 ..
drwxrwxr-x 4 arthur arthur  4096 Aug 27 22:14 .arthurenv
-rw-rw-r-- 1 arthur arthur   430 Aug 27 22:06 .dockerignore
-rw-rw-r-- 1 arthur arthur   459 Aug 27 22:06 .editorconfig
-rw-rw-r-- 1 arthur arthur  2728 Aug 27 22:06 Dockerfile
-rw-rw-r-- 1 arthur arthur  5731 Aug 27 22:06 INSTALL.md
-rw-rw-r-- 1 arthur arthur  1070 Aug 27 22:06 LICENSE
-rw-rw-r-- 1 arthur arthur 19208 Aug 27 22:06 README.md
-rw-rw-r-- 1 arthur arthur   440 Aug 27 22:06 TODO.md
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:06 bin
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:06 cloudformation
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:06 etc
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:06 githooks
drwxrwxr-x 3 arthur arthur  4096 Aug 27 22:06 log_processing
drwxrwxr-x 1 arthur arthur  4096 Aug 27 22:15 python
-rw-rw-r-- 1 arthur arthur  2469 Aug 27 22:06 readme_release.md
-rw-rw-r-- 1 arthur arthur   149 Aug 27 22:06 requirements-dev.txt
-rw-rw-r-- 1 arthur arthur   131 Aug 27 22:06 requirements-linters.txt
-rw-rw-r-- 1 arthur arthur   218 Aug 27 22:06 requirements.txt
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:52 schemas
-rw-rw-r-- 1 arthur arthur  1543 Aug 28 16:44 setup.cfg
-rw-rw-r-- 1 arthur arthur  1565 Aug 27 22:06 setup.py
drwxrwxr-x 2 arthur arthur  4096 Aug 27 22:06 sql
(venv) (aws:, prefix:) $ cd /opt/src/arthur-redshift-etl/
(venv) (aws:, prefix:) $ touch hello
(venv) (aws:, prefix:) $ python setup.py develop
running develop
running egg_info
writing entry points to python/redshift_etl.egg-info/entry_points.txt
writing dependency_links to python/redshift_etl.egg-info/dependency_links.txt
writing top-level names to python/redshift_etl.egg-info/top_level.txt
writing python/redshift_etl.egg-info/PKG-INFO
reading manifest file 'python/redshift_etl.egg-info/SOURCES.txt'
writing manifest file 'python/redshift_etl.egg-info/SOURCES.txt'
running build_ext
Creating /opt/local/redshift_etl/venv/lib/python3.5/site-packages/redshift-etl.egg-link (link to python)
Removing redshift-etl 1.28.0 from easy-install.pth file
Adding redshift-etl 1.28.0 to easy-install.pth file
Installing run_tests.py script to /opt/local/redshift_etl/venv/bin
Installing arthur.py script to /opt/local/redshift_etl/venv/bin
Installing compare_events.py script to /opt/local/redshift_etl/venv/bin
Installing install_extraction_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing install_pizza_load_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing install_rebuild_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing install_refresh_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing install_upgrade_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing install_validation_pipeline.sh script to /opt/local/redshift_etl/venv/bin
Installing launch_ec2_instance.sh script to /opt/local/redshift_etl/venv/bin
Installing launch_emr_cluster.sh script to /opt/local/redshift_etl/venv/bin
Installing re_run_partial_pipeline.py script to /opt/local/redshift_etl/venv/bin
Installing sns_subscribe.sh script to /opt/local/redshift_etl/venv/bin
Installing submit_arthur.sh script to /opt/local/redshift_etl/venv/bin
Installing terminate_emr_cluster.sh script to /opt/local/redshift_etl/venv/bin

Installed /opt/src/arthur-redshift-etl/python
Processing dependencies for redshift-etl==1.28.0
Finished processing dependencies for redshift-etl==1.28.0
(venv) (aws:, prefix:) $ exit

Works as expected.

Then, the run_arthur.sh test:

./bin/run_arthur.sh
You must set DATA_WAREHOUSE_CONFIG when not specifying the config directory.

Ok, fair enough, I'll set one (to an existing directory on my machine):

export DATA_WAREHOUSE_CONFIG=/home/bhtucker/third_party/warehouse_config/
$ ./bin/run_arthur.sh
++ pwd
+ docker run --rm --interactive --tty --volume /home/bhtucker/third_party:/opt/data-warehouse --volume /home/bhtucker/third_party/arthur-redshift-etl:/opt/src/arthur-redshift-etl --volume /home/bhtucker/.aws:/home/arthur/.aws --volume /home/bhtucker/.ssh:/home/arthur/.ssh:ro --env DATA_WAREHOUSE_CONFIG=/opt/data-warehouse/warehouse_config --env ARTHUR_DEFAULT_PREFIX=bhtucker arthur-redshift-etl:latest
+ cd /opt/src/arthur-redshift-etl
+ python3 setup.py --quiet develop
error: [Errno 13] Permission denied

With 'probes':

docker run --rm --interactive --tty --volume /home/bhtucker/third_party:/opt/data-warehouse --volume /home/bhtucker/third_party/arthur-redshift-etl:/opt/src/arthur-redshift-etl --volume /home/bhtucker/.aws:/home/arthur/.aws --volume /home/bhtucker/.ssh:/home/arthur/.ssh:ro --env DATA_WAREHOUSE_CONFIG=/opt/data-warehouse/warehouse_config --env ARTHUR_DEFAULT_PREFIX=bhtucker --entrypoint bash arthur-redshift-etl:latest
(venv) (aws:, prefix:bhtucker) $ ls -la /opt/src/arthur-redshift-etl/
total 516
drwxrwxr-x 15   1002   1005   4096 Aug 28 16:44 .
drwxr-xr-x  1 arthur arthur   4096 Aug 28 16:45 ..
drwxrwxr-x  4   1002   1005   4096 Aug 27 22:14 .arthurenv
-rw-rw-r--  1   1002   1005    430 Aug 27 22:06 .dockerignore
-rw-rw-r--  1   1002   1005    459 Aug 27 22:06 .editorconfig
drwxrwxr-x  8   1002   1005   4096 Aug 28 16:44 .git
drwxrwxr-x  3   1002   1005   4096 Aug 27 22:06 .github
-rw-rw-r--  1   1002   1005    406 Aug 27 22:06 .gitignore
-rw-rw-r--  1   1002   1005   2728 Aug 27 22:06 Dockerfile
-rw-rw-r--  1   1002   1005   5731 Aug 27 22:06 INSTALL.md
-rw-rw-r--  1   1002   1005   1070 Aug 27 22:06 LICENSE
-rw-rw-r--  1   1002   1005  19208 Aug 27 22:06 README.md
-rw-rw-r--  1   1002   1005    440 Aug 27 22:06 TODO.md
-rw-rw-r--  1   1002   1005 381244 Aug 27 22:51 arthur.log
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 bin
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 cloudformation
drwxrwxr-x  2   1002   1005   4096 Aug 27 23:07 dist
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 etc
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 githooks
drwxrwxr-x  3   1002   1005   4096 Aug 27 22:06 log_processing
drwxrwxr-x  5   1002   1005   4096 Aug 27 22:15 python
-rw-rw-r--  1   1002   1005   2469 Aug 27 22:06 readme_release.md
-rw-rw-r--  1   1002   1005    149 Aug 27 22:06 requirements-dev.txt
-rw-rw-r--  1   1002   1005    131 Aug 27 22:06 requirements-linters.txt
-rw-rw-r--  1   1002   1005    218 Aug 27 22:06 requirements.txt
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:52 schemas
-rw-rw-r--  1   1002   1005   1543 Aug 28 16:44 setup.cfg
-rw-rw-r--  1   1002   1005   1565 Aug 27 22:06 setup.py
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 sql
drwxrwxr-x  2   1002   1005   4096 Aug 27 22:06 wiki
(venv) (aws:, prefix:bhtucker) $ cd /opt/src/arthur-redshift-etl/
(venv) (aws:, prefix:bhtucker) $ touch hello
touch: cannot touch 'hello': Permission denied
(venv) (aws:, prefix:bhtucker) $ python setup.py develop
running develop
running egg_info
error: [Errno 13] Permission denied

So I suppose arthur the container user isn't allowed to talk back out to my src dir. I confess I don't use volumes for anything but read-only config files so don't know how this should work.

tvogels01 commented 3 years ago

I'll have to dig in to Docker volumes to see what might cause this issue. For me, the owner of /opt/src stays arthur.

To unblock you, I'd suggest that you switch into your warehouse directory before starting arthur. The reason is that this will put the image into "standalone" mode which means it won't try to run python setup.py develop.

Here's what happens on my laptop:

~/repos/harrystech/arthur-redshift-etl/bin/run_arthur.sh 
Did not find source path (looked for setup.py) -- switching to standalone mode.
Changes to code in /opt/src/arthur-redshift-etl will not be preservd between runs.
However, changes to your schemas or config will be reflected in your local filesystem.
+ docker run --rm --interactive --tty ... --env DATA_WAREHOUSE_CONFIG=/opt/data-warehouse/config_data_development --env ARTHUR_DEFAULT_PREFIX=tom ... arthur-redshift-etl:latest

Also, I just realized that directories get mounted multiple times in your setup. Maybe that's part of the issue? Please create a new directory above the config so that it's parallel to this repo.

--volume /home/bhtucker/third_party:/opt/data-warehouse --volume /home/bhtucker/third_party/arthur-redshift-etl:/opt/src/arthur-redshift-etl

Notice how third-party is in both places. Goal is something like this:

--volume /home/bhtucker/third_party/warehouse_repo:/opt/data-warehouse --volume /home/bhtucker/third_party/arthur-redshift-etl:/opt/src/arthur-redshift-etl

Assuming that the configuration directory is now in /home/bhtucker/third_party/warehouse_repo/warehouse_config

bhtucker commented 3 years ago

Running from the 'warehouse' directory and not from arthur source directory seems wise.

I also forgot the sibling setup is 'repo' then 'config' adjacent to e.g. 'sources'. Added that layer.

In fact I did a pip install -e . before running in Docker (muscle memory). So I cleared it out and tried again without that; the error simply changes slightly to error: could not create 'python/redshift_etl.egg-info': Permission denied.

Perhaps I have some global docker settings or version info I don't know about. Does classic the virtualenv setup still work? That's my preference anyway :)

tvogels01 commented 3 years ago

Using a virtual env might work but I haven't tested in a while. I don't see why it wouldn't.

Here's what the permissions should look like:

$ ls -lad /opt/ /opt/data-warehouse/ /opt/local/ /opt/src/ /opt/src/arthur-redshift-etl/python/redshift_etl.egg-info/
drwxr-xr-x  1 root   root   4096 Aug 28 11:40 /opt/
drwxr-xr-x 37 arthur arthur 1184 Aug 28 16:55 /opt/data-warehouse/
drwxr-xr-x  1 arthur arthur 4096 Aug 28 11:40 /opt/local/
drwxr-xr-x  1 arthur arthur 4096 Aug 28 11:40 /opt/src/
drwxr-xr-x  8 arthur arthur  256 Aug 28 17:32 /opt/src/arthur-redshift-etl/python/redshift_etl.egg-info/

Note how the user stays arthur.

In the end, this is the directory structure that you're aiming for:

top/warehouse/config/
top/warehouse/schemas/
top/arthur-redshift-etl/bin/
top/arthur-redshift-etl/etc/
top/arthur-redshift-etl/python/
...

For now, what happens if you simply comment out the line python setup.py develop in bin/entrypoint.sh? It's not needed unless you develop code inside the Docker container. The other lines (setting PATH and activating the virtual env) are needed.

tvogels01 commented 3 years ago

Please take try out: Allow read-only source directory #242

When the Docker image is created, an install step already installs the ETL source. So the python setup.py develop is great but not necessary to have a running container.