pypa / virtualenv

Virtual Python Environment builder
https://virtualenv.pypa.io
MIT License
4.83k stars 1.03k forks source link

Created virtual env does not honour sys.base_prefix correctly #2770

Open pelson opened 1 month ago

pelson commented 1 month ago

Issue

When creating a virtual environment, the base prefix of the virtual environment should be the same as the base prefix of the Python that is creating the virtual environment - there should be no additional resolve steps with regards to symlinks.

This behaviour is seen in venv, and was implemented as expected until https://github.com/pypa/virtualenv/pull/2686, where symlinks were resolved.

Why is this important?

In big (scientific) institutions, it is common to have a network mounted filesystem which can be access from all managed machines. This could be to mount a homespace, or to mount some data etc. (I've seen both). To scale this up, it is necessary to have multiple filesystem servers all using the same underlying storage. Machines are then clustered to point to different servers, but the user doesn't know which server a machine is talking to. When combining this with something such as autofs, you find that the server being contacted is in the path (e.g. /nfs/some-machine/), and to smoothen this out accross machines, the managed machines get a canonical symlink (e.g. /project/{my-project} which symlinks to the specific machine mountpoint). Essentially:

machine1:

ls -ltr /project/my-project -> /nfs/nfs-server1

machine2:

ls -ltr /project/my-project -> /nfs/nfs-server2

With this setup, you can create a virtual environment in /project/my-project which works on both machines IFF the symlink is not resolved. This is the behaviour of CPython, and also the behaviour of venv and virtualenv<20.25.1.

Reproducer

Written in the form of a pytest (w validation against venv also):

import os
import pathlib
import sys
import subprocess
import typing

import pytest

def read_venv_config(venv_prefix: pathlib.Path) -> typing.Dict[str, str]:
    venv_cfg_path = venv_prefix / 'pyvenv.cfg'

    cfg = {}
    with venv_cfg_path.open('rt') as cfg_fh:
        for line in cfg_fh:
            name, _, value = line.partition('=')
            cfg[name.strip()] = value.strip()
    return cfg

@pytest.mark.parametrize("venv_impl", ["venv", "virtualenv"])
def test_symlink_python(tmp_path: pathlib.Path, venv_impl: str) -> None:
    py_link = tmp_path / "some-other-prefix"
    pathlib.Path(py_link).symlink_to(sys.base_prefix)

    dest_venv = tmp_path / "some-venv"

    py_bin = py_link / 'bin' / 'python'
    subprocess.run([py_bin, '-m', venv_impl, str(dest_venv)], check=True)

    cfg = read_venv_config(dest_venv)

    py_base_prefix = subprocess.check_output([py_bin, '-c', 'import sys; print(sys.base_prefix)'], text=True).strip()

    # Get the link of the py bin. Don't recursively resolve this (like pathlib.resolve would do)
    py_bin_link = os.readlink(dest_venv / 'bin' / 'python')
    assert py_bin_link == str(py_link / 'bin' / 'python')

    assert str(py_link) == py_base_prefix
    assert py_base_prefix + '/bin' == cfg['home']

The result is a pass in 20.25.0 and a fail since:

>       assert py_base_prefix + '/bin' == cfg['home']
E       AssertionError: assert '/tmp/pytest-of-pelson/pytest-401/test_symlink_python_virtualenv0/some-other-prefix/bin' == '/path/to/my/environment/bin'
E         - /path/to/my/environment/bin
E         + /tmp/pytest-of-pelson/pytest-401/test_symlink_python_virtualenv0/some-other-prefix/bin

It is worth noting that the implementation of https://github.com/pypa/virtualenv/pull/2686 has a bug in the fact that the symlink at dest_venv / 'bin' / 'python' points to sys.base_prefix / 'bin' / 'python', and not the resolved symlink location of sys.base_prefix. Therefore the home value is inconsistent with the symlink that is created by virtualenv. The test validates this.

Implications

There are two issues which https://github.com/pypa/virtualenv/pull/2686 closed:

To be honest, I'm not sure what https://github.com/pypa/virtualenv/issues/2682 is asking for. Perhaps is is a request for behaviour that is different for venv and also the std library (wrt. sys.base_prefix) @mayeut.

For https://github.com/pypa/virtualenv/issues/2684, I also don't fully understand the reason for this being a virtualenv issue (this is my problem, not a problem with the issue itself) - and somebody who knows what the correct behaviour should be would need to chime in (perhaps @ofek?).

pfmoore commented 1 month ago

It seems to me there are two separate use cases related to symlinks (as well as #2684, which is related to canonicalisation of filenames, rather than symlinks).

In #2682, the use case is running /foo/bar/bin/python virtualenv.py env_name, where /foo/bar/bin/python is a symlink to /base_python/bin/python (in particular, note that there's no Python environment in /foo/bar, there's just the symlink in bin). In that case, you have to resolve the symlink to find the actual base environment, where the lib directory and other parts of the environment are situated.

In the case here, though, you are running /foo/bar/bin/python virtualenv.py env_name, where /foo/bar is itself a symlink to a full Python environment. In that case, the environment can be referenced either by its base name, or via the symlink /foo/bar/{bin,lib,...}. For your use case, you want the symlink to be retained, to allow portability when the symlink is constant, but the linked-to environment changes.

To be honest, I don't think there's a clear "one size fits all" solution here. We could say that if the python executable itself is a symlink, then resolve it, but if a component is a symlink, then don't. But that seems fairly hacky, and prone to errors if there are situations we haven't considered.

My personal experience with symlinks is on Windows, where if an exe loads a DLL saved alongside it, a symlink won't run as the OS doesn't follow the symlink to find the DLL. So by analogy with that, I'd have been inclined to say that #2682 wasn't a bug, but was simply a consequence of how symlinks work when trying to find dependent files relative to the executable. But I'm clearly in a minority here, as #2682 was accepted and fixed.

I don't have a good answer here, except to say that we should probably follow the behaviour of the core venv module, simply because in the absence of an obviously correct rule, consistency is key. Then, if anyone wants a change to the rules, they should start with venv rather than here.

ofek commented 1 month ago

I don't have a good answer here, except to say that we should probably follow the behaviour of the core venv module, simply because in the absence of an obviously correct rule, consistency is key. Then, if anyone wants a change to the rules, they should start with venv rather than here.

This is interesting, I thought it was other way around and we do novel stuff here which venv may or may not choose to incorporate. One such example is virtualenv being way faster.

pfmoore commented 1 month ago

I'd say that behaviour should be consistent, but quality of life details (like performance, additional shell integrations) are what distinguishes virtualenv. So venv defines the base mechanism, we make it more user friendly.

mayeut commented 1 month ago

I thought it was other way around and we do novel stuff here which venv may or may not choose to incorporate

I thought the same.

I don't have a good answer here, except to say that we should probably follow the behaviour of the core venv module, simply because in the absence of an obviously correct rule, consistency is key. Then, if anyone wants a change to the rules, they should start with venv rather than here.

Issues were filed in both cpython & virtualenv (with a link to the cpython issue). While the cpython issue is still opened, the virtualenv one was "fixed" introducing a regression in other use-cases.

If the aim of virtualenv is to be bug to bug compatible with venv (i.e. consistent behavior), which might differ depending on the python minor version or the python implementation, then so be it. In that case, the bug report template in virtualenv should probably be updated to reflect just that.

To be honest, I don't think there's a clear "one size fits all" solution here. We could say that if the python executable itself is a symlink, then resolve it, but if a component is a symlink, then don't. But that seems fairly hacky, and prone to errors if there are situations we haven't considered.

As of now, there are 3 identified use-cases related to symlinks that might be addressed by the hack mentioned (I don't know enough about the Windows one to be sure nothing else would be required). IMHO, living with the status-quo of one or the other being broken does not seem beneficial to the ecosystem as a whole. At least those 3 are identified and can be unit-tested (but, per consistent behavior with venv, this might have to wait on the cpython issue being addressed first).