Shell detection failure inside docker image on ARM Macbook

ltalirz commented 3 years ago

We have a docker container, for which we're running into a shell detection failure:

Traceback (most recent call last):
  File "/opt/conda/bin/verdi", line 8, in <module>
    sys.exit(verdi())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 777, in main
    _bashcomplete(self, prog_name, complete_var)
  File "/opt/conda/lib/python3.7/site-packages/click_completion/patch.py", line 122, in _shellcomplete
    echo(get_code(prog_name=prog_name, env_name=complete_var))
  File "/opt/conda/lib/python3.7/site-packages/click_completion/core.py", line 299, in get_code
    shell = get_auto_shell()
  File "/opt/conda/lib/python3.7/site-packages/click_completion/lib.py", line 125, in get_auto_shell
    return shellingham.detect_shell()[0]
  File "/opt/conda/lib/python3.7/site-packages/shellingham/__init__.py", line 24, in detect_shell
    raise ShellDetectionFailure()
shellingham._core.ShellDetectionFailure

For some reason, this only happens when running the docker container on M1 Macbooks (on Intel Macbooks, the error in the container does not occur). Observations:

Shellingham version in container is 1.4.0
Shellingham 1.4.0 in a local conda environment on my machine detects the shell just fine (tested with both python 3.7 and 3.9)
Both inside container and on my machine, the detected os.name is posix

For some reason, the shell detection for posix returns None https://github.com/sarugaku/shellingham/blob/325c643e89877eb325adf44bc62547251e87acef/src/shellingham/posix/__init__.py#L82-L90

Steps to reproduce:

Own a Macbook M1 (sorry!)

docker run -d -it --name aiida-core aiidateam/aiida-core:1.6.5
docker exec -it --user aiida aiida-core bash

ltalirz commented 3 years ago

If you have an idea what could be the reason, I'd be happy to look deeper inside get_shell to see where the problem lies

ltalirz commented 3 years ago

Further info by @ramirezfranciscof in https://github.com/aiidalab/aiidalab-docker-stack/issues/202#issuecomment-937847268

Is it possible that shellingham somehow gets confused when the architecture the docker image was built for does not match the architecture of the host OS?

ramirezfranciscof commented 3 years ago

I managed to get a "more minimal" example, if that helps. This is the Dockerfile:

# syntax=docker/dockerfile:1
FROM python:3-slim
RUN pip3 install shellingham
ENTRYPOINT ["tail", "-f", "/dev/null"]

Then I build it with docker build --platform linux/amd64 -t "baseimage_test" ., run the container and log in (docker exec -it <image_name> /bin/bash) to execute the following:

root@baseimage_test:/# python3
Python 3.10.0 (default, Oct  5 2021, 23:49:26) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shellingham
>>> shellingham.detect_shell()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/shellingham/__init__.py", line 24, in detect_shell
    raise ShellDetectionFailure()
shellingham._core.ShellDetectionFailure

If I build without the --platform linux/amd64 then this doesn't happen:

root@baseimage_test:/# python3
Python 3.10.0 (default, Oct  6 2021, 00:09:42) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shellingham
>>> shellingham.detect_shell()
('bash', '/bin/bash')

uranusjr commented 3 years ago

I don't have an ARM machine to test this out, so you'll probably need to debug this mostly on your own. Note that macOS and Linux are likely using different implementations (macOS uses the ps implementation, and Linux the /proc-based one) and will need to be debugged separately (although I do kind of suspect the root cause is the same).

I'd probably start with doing something like

>>> import shellingham.posix.proc
>>> print(shellingham.posix.proc.get_process_mapping())

and see if there's anything like a shell in there. If not, I'd manually break the loop apart and see where the parsing code went wrong. The fact that this does not happen if you use a native container seems to also indicate that this is something related to the cross-arch translation; maybe a process of Python built against Intel can't map its pid correctly to native ARM? No idea to be honest.

ltalirz commented 3 years ago

Thanks for the hints @uranusjr !

Indeed, there is nothing that looks like a shell in the process mapping

(base) aiida@b92ecb60a87f:/$ python
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shellingham.posix.proc
>>> from pprint import pprint
>>> pprint(shellingham.posix.proc.get_process_mapping()){'1191': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/erlang/erts-9.2/bin/beam.smp', '-W', 'w', '-A', '64', '-P', '1048576', '-t', '5000000', '-stbt', 'db', '-zdbbl', '32000', '-K', 'true', '-B', 'i', '--', '-root', '/usr/lib/erlang', '-progname', 'erl', '--', '-home', '/var/lib/rabbitmq', '--', '-pa', '/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/ebin', '-noshell', '-noinput', '-s', 'rabbit', 'boot', '-sname', 'rabbit@localhost', '-boot', 'start_sasl', '-kernel', 'inet_default_connect_options', '[{nodelay,true}]', '-sasl', 'errlog_type', 'error', '-sasl', 'sasl_error_logger', 'false', '-rabbit', 'error_logger', '{file,"/home/aiida/.rabbitmq/log/rabbit@localhost.log"}', '-rabbit', 'sasl_error_logger', '{file,"/home/aiida/.rabbitmq/log/rabbit@localhost-sasl.log"}', '-rabbit', 'enabled_plugins_file', '"/etc/rabbitmq/enabled_plugins"', '-rabbit', 'plugins_dir', '"/usr/lib/rabbitmq/plugins:/usr/lib/rabbitmq/lib/rabbitmq_server-3.6.10/plugins"', '-rabbit', 'plugins_expand_dir', '"/home/aiida/.rabbitmq/rabbit@localhost-plugins-expand"', '-os_mon', 'start_cpu_sup', 'false', '-os_mon', 'start_disksup', 'false', '-os_mon', 'start_memsup', 'false', '-mnesia', 'dir', '"/home/aiida/.rabbitmq/rabbit@localhost"', '-kernel', 'inet_dist_listen_min', '25672', '-kernel', 'inet_dist_listen_max', '25672'), pid='1191', ppid='829'),
 '1393': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/erlang/erts-9.2/bin/erl_child_setup', '1048576'), pid='1393', ppid='1191'),
 '1449': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/erlang/erts-9.2/bin/inet_gethost', '4'), pid='1449', ppid='1393'),
 '1453': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/erlang/erts-9.2/bin/inet_gethost', '4'), pid='1453', ppid='1449'),
 '1607': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1607', ppid='1'),
 '1953': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1953', ppid='1607'),
 '1955': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1955', ppid='1607'),
 '1957': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1957', ppid='1607'),
 '1958': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1958', ppid='1607'),
 '1959': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1959', ppid='1607'),
 '1960': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/postgresql/10/bin/postgres', '-D', '/home/aiida/.postgresql'), pid='1960', ppid='1607'),
 '2053': Process(args=('/usr/bin/qemu-x86_64', '/usr/bin/runsv', 'cron'), pid='2053', ppid='2050'),
 '2055': Process(args=('/usr/bin/qemu-x86_64', '/usr/bin/runsv', 'sshd'), pid='2055', ppid='2050'),
 '2059': Process(args=('/usr/bin/qemu-x86_64', '/usr/sbin/cron', '-f'), pid='2059', ppid='2053'),
 '2081': Process(args=('/opt/conda/bin/python',), pid='2081', ppid='0'),
 '598': Process(args=('/usr/bin/qemu-x86_64', '/usr/bin/ssh-agent'), pid='598', ppid='1'),
 '624': Process(args=('/usr/bin/qemu-x86_64', '/usr/lib/erlang/erts-9.2/bin/epmd', '-daemon'), pid='624', ppid='1'),
 '715': Process(args=('/usr/bin/qemu-x86_64', '/bin/sh', '/usr/sbin/rabbitmq-server'), pid='715', ppid='1'),
 '829': Process(args=('/usr/bin/qemu-x86_64', '/bin/sh', '/usr/lib/rabbitmq/bin/rabbitmq-server'), pid='829', ppid='715')}

I should note that there is also no shell in the output of ps (is this expected)?

(base) aiida@b92ecb60a87f:/$ ps
  PID TTY          TIME CMD
  598 ?        00:00:00 ssh-agent
 1607 ?        00:00:00 postgres
 1953 ?        00:00:00 postgres
 1955 ?        00:00:00 postgres
 1957 ?        00:00:00 postgres
 1958 ?        00:00:00 postgres
 1959 ?        00:00:00 postgres
 1960 ?        00:00:00 postgres
 2864 ?        00:00:00 ps

Finally, here is an example of a stat file and the result of the parsing. Not sure whether this is intended behavior?

(base) aiida@b92ecb60a87f:/$ cat /proc/1/stat
1 (my_init) S 0 1 1 34816 1 4194560 6753 806908 0 11299 26 2 4432 469 20 0 2 0 21127 263921664 7921 18446744073709551615 1 1 0 0 0 0 0 16781312 1988161279 0 0 0 17 1 0 0 0 0 0 0 0 0 0 0 0 0 0
(base) aiida@b92ecb60a87f:/$ cat /proc/1607/stat
1607 (postgres) S 1 1502 1502 0 -1 4194560 4609 49891 5296 2087 11 4 226 43 20 0 2 0 23504 855425024 9612 18446744073709551615 4194304 7388541 281474097834016 0 0 0 0 19935232 1988218623 0 0 0 17 2 0 0 0 0 0 7455192 7778992 275849216 281474097838868 281474097838952 281474097838952 281474097840084 0
(base) aiida@b92ecb60a87f:/$ python
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import shellingham.posix.proc as p
>>> p.detect_proc()
'stat'
>>> p._get_stat(1,'stat')
('34816', '0')
>>> p._get_stat(1607,'stat')
('0', '1')

P.S. This is just for future reference: it turns out that even on my machine, using the same container, the shell detection error is not always raised. I was recently able to launch the docker container and enter it without the error occuring. I removed the container docker rm -f, created a new one and the error was still gone. Then I deactivated my conda environment on the host, launched a new container and the problem reappeared.

After this, activating the conda environment again did not remove the error, however - it now persisted. It is not clear to me what was going on here and how to reliably reproduce it.

uranusjr commented 3 years ago

I should note that there is also no shell in the output of ps (is this expected)?

Maybe. I don't really understand this behaviour either, probably due to some kind of magic in Docker or the Linux kernel. But in any way, if the OS is not reporting the existence of a shell, there's really nothing we can do… The application using shellingham is supposed to provide a reasonable default because this kind of oddities do happen, and shell detection can only do so much. This can probably be better explained by some Docker or OCI or Linux on ARM expert, and I am really none of those.

brandonros commented 2 years ago

git clone git@github.com:HenryFBP/trading-bot.git
cd trading-bot/
DOCKER_DEFAULT_PLATFORM=linux/amd64 docker run -it -v $(pwd):/mnt python:3.7-slim bash
# from inside the running Docker shell
cd /mnt
pip install pipenv
pipenv --python /usr/local/bin/python install
pipenv shell

I get this even trying to "abstract" away the arm64 and stick to good ol x86_64 (amd64) through emulation

# pipenv shell
Traceback (most recent call last):
  File "/usr/local/bin/pipenv", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pipenv/cli/options.py", line 56, in main
    return super().main(*args, **kwargs, windows_expand_args=False)
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/decorators.py", line 84, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pipenv/vendor/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/pipenv/cli/command.py", line 429, in shell
    pypi_mirror=state.pypi_mirror,
  File "/usr/local/lib/python3.7/site-packages/pipenv/core.py", line 2442, in do_shell
    shell = choose_shell(project)
  File "/usr/local/lib/python3.7/site-packages/pipenv/shells.py", line 239, in choose_shell
    type_, command = detect_info(project)
  File "/usr/local/lib/python3.7/site-packages/pipenv/shells.py", line 29, in detect_info
    raise ShellDetectionFailure
pipenv.vendor.shellingham._core.ShellDetectionFailure

brandonros commented 2 years ago

# PIPENV_SHELL='/bin/bash' pipenv shell
Launching subshell in virtual environment...
 . /root/.local/share/virtualenvs/mnt-MaCywDhH/bin/activate
root@65c4a982e2d5:/mnt#  . /root/.local/share/virtualenvs/mnt-MaCywDhH/bin/activate
(mnt) root@65c4a982e2d5:/mnt#

That fixes it for some reason?

uranusjr commented 2 years ago

Shellingham is never called if you set PIPENV_SHELL, since the variable forces pipenv shell to use that instead of doing any detection. So yeah, you could use that to work around whatever the problem is here.

uranusjr commented 2 years ago

Alright I finally have a chance to look into this. So for this specific environment combination (an Intel image running in Docker on ARM Mac), the parent process is (interestingly) hooked to a different TTY from the Python process itself. I suspect this is due to some simulation implementation detail that leaked into the container. So this is “easily” amendable by removing the TTY check, but I’m hesitant to just do that since it makes the proc implementation quite a bit slower.

An alternative approach would be to do some refactoring and make process look up lazier, so we only access process IDs that are related to the current process. I’m not particularly motivated to do this myself (especially considering that setting PIPENV_SHELL is a pretty adequate workaround), but anyone would be much welcomed to contribute to this.

sarugaku / shellingham

Shell detection failure inside docker image on ARM Macbook #55

Steps to reproduce: