python / cpython

The Python programming language
https://www.python.org
Other
62.74k stars 30.07k forks source link

Make pyvenv style virtual environments easier to configure when embedding Python #66409

Open fe491b48-23c2-4033-aaa2-1a6613895466 opened 10 years ago

fe491b48-23c2-4033-aaa2-1a6613895466 commented 10 years ago
BPO 22213
Nosy @ncoghlan, @pitrou, @vstinner, @methane, @ericsnowcurrently, @zooba, @ndjensen, @LeslieGerman, @M-Kerr, @abrunner73
Dependencies
  • bpo-22257: PEP 432 (PEP 587): Redesign the interpreter startup sequence
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', '3.8'] title = 'Make pyvenv style virtual environments easier to configure when embedding Python' updated_at = user = 'https://bugs.python.org/grahamd' ``` bugs.python.org fields: ```python activity = actor = 'ndjensen' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'grahamd' dependencies = ['22257'] files = [] hgrepos = [] issue_num = 22213 keywords = [] message_count = 31.0 messages = ['225434', '225436', '225437', '225739', '225742', '225771', '225774', '225890', '334926', '334948', '335015', '335468', '335470', '335479', '335484', '335648', '335650', '335688', '335692', '335749', '336793', '343636', '352905', '354856', '354857', '354858', '361600', '361869', '362260', '366570', '384496'] nosy_count = 13.0 nosy_names = ['ncoghlan', 'pitrou', 'vstinner', 'pyscripter', 'grahamd', 'methane', 'eric.snow', 'steve.dower', 'Henning.von.Bargen', 'ndjensen', 'Leslie', 'M.Kerr', 'abrunner73'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue22213' versions = ['Python 3.8'] ```

    fe491b48-23c2-4033-aaa2-1a6613895466 commented 10 years ago

    In am embedded system, as the 'python' executable is itself not run and the Python interpreter is initialised in process explicitly using PyInitialize(), in order to find the location of the Python installation, an elaborate sequence of checks is run as implemented in calculate_path() of Modules/getpath.c.

    The primary mechanism is usually to search for a 'python' executable on PATH and use that as a starting point. From that it then back tracks up the file system from the bin directory to arrive at what would be the perceived equivalent of PYTHONHOME. The lib/pythonX.Y directory under that for the matching version X.Y of Python being initialised would then be used.

    Problems can often occur with the way this search is done though.

    For example, if someone is not using the system Python installation but has installed a different version of Python under /usr/local. At run time, the correct Python shared library would be getting loaded from /usr/local/lib, but because the 'python' executable is found from /usr/bin, it uses /usr as sys.prefix instead of /usr/local.

    This can cause two distinct problems.

    The first is that there is no Python installation at all under /usr corresponding to the Python version which was embedded, with the result of it not being able to import 'site' module and therefore failing.

    The second is that there is a Python installation of the same major/minor but potentially a different patch revision, or compiled with different binary API flags or different Unicode character width. The Python interpreter in this case may well be able to start up, but the mismatch in the Python modules or extension modules and the core Python library that was actually linked can cause odd errors or crashes to occur.

    Anyway, that is the background.

    For an embedded system the way this problem was overcome was for it to use Py_SetPythonHome() to forcibly override what should be used for PYTHONHOME so that the correct installation was found and used at runtime.

    Now this would work quite happily even for Python virtual environments constructed using 'virtualenv' allowing the embedded system to be run in that separate virtual environment distinct from the main Python installation it was created from.

    Although this works for Python virtual environments created using 'virtualenv', it doesn't work if the virtual environment was created using pyvenv.

    One can easily illustrate the problem without even using an embedded system.

    $ which python3.4
    /Library/Frameworks/Python.framework/Versions/3.4/bin/python3.4
    
    $ pyvenv-3.4 py34-pyvenv
    
    $ py34-pyvenv/bin/python
    Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21)
    [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.prefix
    '/private/tmp/py34-pyvenv'
    >>> sys.path
    ['', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin', '/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload', '/private/tmp/py34-pyvenv/lib/python3.4/site-packages']
    
    $ PYTHONHOME=/tmp/py34-pyvenv python3.4
    Fatal Python error: Py_Initialize: unable to load the file system codec
    ImportError: No module named 'encodings'
    Abort trap: 6

    The basic problem is that in a pyvenv virtual environment, there is no duplication of stuff in lib/pythonX.Y, with the only thing in there being the site-packages directory.

    When you start up the 'python' executable direct from the pyvenv virtual environment, the startup sequence checks know this and consult the pyvenv.cfg to extract the:

    home = /Library/Frameworks/Python.framework/Versions/3.4/bin

    setting and from that derive where the actual run time files are.

    When PYTHONHOME or Py_SetPythonHome() is used, then the getpath.c checks blindly believe that is the authoritative value:

        /* If PYTHONHOME is set, we believe it unconditionally */
        if (home) {
            wchar_t *delim;
            wcsncpy(prefix, home, MAXPATHLEN);
            prefix[MAXPATHLEN] = L'\0';
            delim = wcschr(prefix, DELIM);
            if (delim)
                *delim = L'\0';
            joinpath(prefix, lib_python);
            joinpath(prefix, LANDMARK);
            return 1;
        }
    Because of this, the problem above occurs as the proper runtime directories for files aren't included in sys.path. The result being that the 'encodings' module cannot even be found.

    What I believe should occur is that PYTHONHOME should not be believed unconditionally. Instead there should be a check to see if that directory contains a pyvenv.cfg file and if there is one, realise it is a pyvenv style virtual environment and do the same sort of adjustments which would be made based on looking at what that pyvenv.cfg file contains.

    For the record this issue is affecting Apache/mod_wsgi and right now the only workaround I have is to tell people that in addition to setting the configuration setting corresponding to PYTHONHOME, to use configuration settings to have the same effect as doing:

    PYTHONPATH=/Library/Frameworks/Python.framework/Versions/3.4/lib/python34.zip:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/plat-darwin:/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/lib-dynload

    so that the correct runtime files are found.

    I am still trying to work out a more permanent workaround I can add to mod_wsgi code itself since can't rely on a fix for existing Python versions with pyvenv support.

    Only other option is to tell people not to use pyvenv and use virtualenv instead.

    Right now I can offer no actual patch as that getpath.c code is scary enough that not even sure at this point where the check should be incorporated or how.

    Only thing I can surmise is that the current check for pyvenv.cfg being before the search for the prefix is meaning that it isn't consulted.

    /* Search for an environment configuration file, first in the
       executable's directory and then in the parent directory.
       If found, open it for use when searching for prefixes.
    */
    
    {
        wchar_t tmpbuffer[MAXPATHLEN+1];
        wchar_t *env_cfg = L"pyvenv.cfg";
        FILE * env_file = NULL;
    
        wcscpy(tmpbuffer, argv0_path);
            joinpath(tmpbuffer, env_cfg);
            env_file = _Py_wfopen(tmpbuffer, L"r");
            if (env_file == NULL) {
                errno = 0;
                reduce(tmpbuffer);
                reduce(tmpbuffer);
                joinpath(tmpbuffer, env_cfg);
                env_file = _Py_wfopen(tmpbuffer, L"r");
                if (env_file == NULL) {
                    errno = 0;
                }
            }
            if (env_file != NULL) {
                /* Look for a 'home' variable and set argv0_path to it, if found */
                if (find_env_config_value(env_file, L"home", tmpbuffer)) {
                    wcscpy(argv0_path, tmpbuffer);
                }
                fclose(env_file);
                env_file = NULL;
            }
        }
        pfound = search_for_prefix(argv0_path, home, _prefix, lib_python);
    ncoghlan commented 10 years ago

    Yeah, PEP-432 (my proposal to redesign the startup sequence) could just as well be subtitled "getpath.c hurts my brain" :P

    One tricky part here is going to be figuring out how to test this - perhaps adding a new test option to _testembed and then running it both inside and outside a venv.

    ncoghlan commented 10 years ago

    Graham pointed out that setting PYTHONHOME ends up triggering the same control flow through getpath.c as calling Py_SetPythonHome, so this can be tested just with pyvenv and a suitably configured environment.

    It may still be a little tricky though, since we normally run the pyvenv tests in isolated mode to avoid spurious failures due to bad environment settings...

    ncoghlan commented 10 years ago

    Some more experiments, comparing an installed vs uninstalled Python. One failure mode is that setting PYTHONHOME just plain breaks running from a source checkout (setting PYTHONHOME to the checkout directory also fails):

    $ ./python -m venv --without-pip /tmp/issue22213-py35
    
    $ /tmp/issue22213-py35/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/usr/local /tmp/issue22213-py35/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)

    Trying after running "make altinstall" (which I had previously done for 3.4) is a bit more enlightening:

    $ python3.4 -m venv --without-pip /tmp/issue22213-py34
    
    $ /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/usr/local /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /usr/local
    
    $ PYTHONHOME=/tmp/issue22213-py34 /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)
    
    $ PYTHONHOME=/tmp/issue22213-py34:/usr/local /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    ImportError: No module named 'encodings'
    Aborted (core dumped)
    [ncoghlan@lancre py34]$ PYTHONHOME=/usr/local:/tmp/issue22213-py34/bin /tmp/issue22213-py34/bin/python -c "import sys; print(sys.base_prefix, sys.base_exec_prefix)"
    /usr/local /tmp/issue22213-py34/bin

    I think what this is actually showing is that there's a fundamental conflict between mod_wsgi's expectation of being able to set PYTHONHOME to point to the virtual environment, and the way PEP-405 virtual environments actually work.

    With PEP-405, all the operations in getpath.c expect to execute while pointing to the *base* environment: where the standard library lives. It is then up to site.py to later adjust the based prefix location, as can be demonstrated by the fact pyvenv.cfg isn't processed if processing the site module is disabled:

    $ /tmp/issue22213-py34/bin/python -c "import sys; print(sys.prefix, sys.exec_prefix)"
    /tmp/issue22213-py34 /tmp/issue22213-py34
    $ /tmp/issue22213-py34/bin/python -S -c "import sys; print(sys.prefix, sys.exec_prefix)"
    /usr/local /usr/local

    At this point in time, there isn't an easy way for an embedding application to say "here's the standard library, here's the virtual environment with user packages" - it's necessary to just override the path calculations entirely.

    Allowing that kind of more granular configuration is one of the design goals of PEP-432, so adding that as a dependency here.

    fe491b48-23c2-4033-aaa2-1a6613895466 commented 10 years ago

    It is actually very easy for me to work around and I released a new mod_wsgi version today which works.

    When I get a Python home option, instead of calling Py_SetPythonHome() with it, I append '/bin/python' to it and call Py_SetProgramName() instead.

    ncoghlan commented 10 years ago

    Excellent! If I recall correctly, that works because we resolve the symlink when looking for the standard library, but not when looking for venv configuration file.

    I also suspect this is all thoroughly broken on Windows - there are so many configuration operations and platform specific considerations that need to be accounted for in getpath.c these days that it has become close to incomprehensible :(

    One of my main goals with PEP-432 is actually to make it possible to rewrite the path configuration code in a more maintainable way - my unofficial subtitle for that PEP is "getpath.c must die!" :)

    fe491b48-23c2-4033-aaa2-1a6613895466 commented 10 years ago

    I only make the change to Py_SetProgramName() on UNIX and not Windows. This is because back in mod_wsgi 1.0 I did actually used to use Py_SetProgramName() but it didn't seem to work in sane way on Windows so changed to Py_SetPythonHome() which worked on both Windows and UNIX. Latest versions of mod_wsgi haven't been updated yet to even build on Windows, so not caring about Windows right now.

    pitrou commented 10 years ago

    That workaround would definitely deserve being wrapped in a higher-level API invokable by embedding applications, IMHO.

    ncoghlan commented 5 years ago

    (Added Victor, Eric, and Steve to the nosy list here, as I'd actually forgotten about this until issue bpo-35706 reminded me)

    Core of the problem: the embedding APIs don't currently offer a Windows-compatible way of setting up "use this base Python and this venv site-packages", and the way of getting it to work on other platforms is pretty obscure.

    zooba commented 5 years ago

    Victor may be thinking about it from time to time (or perhaps it's time to make the rest of the configuration changes plans concrete so we can all help out?), but I'd like to see this as either:

    In the latter case, the main python.exe also gets to define its behavior. So for the most part, we should be able to remove getpath[p].c and move it into the site module, then make that our Python initialization step.

    This would also mean that if you are embedding Python but not allowing imports (e.g. as only a calculation engine), you don't have to do the dance of _denying_ all lookups, you simply don't initialize them.

    But as far as I know, we don't have a concrete vision for "how will consumers embed Python in their apps" that can translate into work - we're still all individually pulling in slightly different directions. Sorting that out is most important - having someone willing to do the customer engagement work to define an actual set of requirements and desirables would be fantastic.

    ncoghlan commented 5 years ago

    Yeah, I mainly cc'ed Victor and Eric since making this easier ties into one of the original design goals for PEP-432 (even though I haven't managed to persuade either of them to become co-authors of that PEP yet).

    vstinner commented 5 years ago

    PEP-432 will allow to give with fine control on parameters used to initialize Python. Sadly, I failed to agree with Nick Coghlan and Eric Snow on the API. The current implementation (_PyCoreConfig and _PyMainInterpreterConfig) has some flaw (don't separate clearly the early initialization and Unicode-ready state, the interpreter contains main and core config whereas some options are duplicated in both configs, etc.).

    See also bpo-35706.

    zooba commented 5 years ago

    I just closed 35706 as a duplicate of this one (the titles are basically identical, which feels like a good hint ;) )

    It seems that the disagreement about the design is fundamentally a disagreement between a "quick, painful but complete fix" and "slow, careful improvements with a transition period". Both are valid approaches, and since Victor is putting actual effort in right now he gets to "win", but I do think we can afford to move faster.

    It seems the main people who will suffer from the pain here are embedders (who are already suffering pain) and the core developers (who explicitly signed up for pain!). But without knowing the end goal, we can't accelerate.

    Currently PEP-432 is the best description we have, and it looks like Victor has been heading in that direction too (deliberately? I don't know :) ). But it seems like a good time to review it, replace the "here's the current state of things" with "here's an imaginary ideal state of things" and fill the rest with "here are the steps to get there without breaking the world".

    By necessity, it touches a lot of people's contributions to Python, but it also has the potential to seriously improve even more people's ability to _use Python (for example, I know an app that you all would recognize the name of who is working on embedding Python right now and would _love certain parts of this side of things to be improved).

    Nick - has the steering council been thinking about ways to promote collaborative development of ideas like this? I'm thinking an Etherpad style environment for the brainstorm period (in lieu of an in-person whiteboard session) that's easy for us all to add our concerns to, that can then be turned into something more formal.

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    vstinner commented 5 years ago

    It seems that the disagreement about the design is fundamentally a disagreement between a "quick, painful but complete fix" and "slow, careful improvements with a transition period". Both are valid approaches, and since Victor is putting actual effort in right now he gets to "win", but I do think we can afford to move faster.

    Technically, the API already exists and is exposed as a private API:

    I'm not really sure of the benefit compared to the current initialization API using Py_xxx global configuration variables (ex: Py_IgnoreEnvironmentFlag) and Py_Initialize().

    _PyCoreConfig basically exposed *all* input parameters used to initialize Python, much more than jsut global configuration variables and the few function that can be called before Py_Initialize(): https://docs.python.org/dev/c-api/init.html

    Currently PEP-432 is the best description we have, and it looks like Victor has been heading in that direction too (deliberately? I don't know :) ).

    Well, it's a strange story. At the beginning, I had a very simple use case... it took me more or less one year to implement it :-) My use case was to add... a new -X utf8 command line option:

    If the utf8 mode is enabled (PEP-540), the encoding must be set to UTF-8, all configuration must be removed and the whole configuration (env vars, cmdline, etc.) must be read again from scratch :-)

    To be able to do that, I had to collect *every single* thing which has an impact on the Python initialization: all things that I moved into _PyCoreConfig.

    ... but I didn't want to break the backward compatibility, so I had to keep support for Py_xxx global configuration variables... and also the few initialization functions like Py_SetPath() or Py_SetStandardStreamEncoding().

    Later it becomes very dark, my goal became very unclear and I looked at the PEP-432 :-)

    Well, I wanted to expose _PyCoreConfig somehow, so I looked at the PEP-432 to see how it can be exposed.

    By necessity, it touches a lot of people's contributions to Python, but it also has the potential to seriously improve even more people's ability to _use Python (for example, I know an app that you all would recognize the name of who is working on embedding Python right now and would _love certain parts of this side of things to be improved).

    _PyCoreConfig "API" makes some things way simpler. Maybe it was already possible to do them previously but it was really hard, or maybe it was just not possible.

    If a _PyCoreConfig field is set: it has the priority over any other way to initialize the field. _PyCoreConfig has the highest prioririty.

    For example, _PyCoreConfig allows to completely ignore the code which computes sys.path (and related variables) by setting directly the "path configuration":

    The code which initializes these fields is really complex. Without _PyCoreConfig, it's hard to make sure that these fields are properly initialized as an embedder would like.

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    Sorry, I'm not sure of the API / structures, but when I discussed with Eric Snow at the latest sprint, we identified different steps in the Python initialization:

    --

    Once I experimented to reorganize _PyCoreConfig and _PyMainInterpreterConfig to avoid redundancy: add a _PyPreConfig which contains only fields which are needed before _PyMainInterpreterConfig. With that change, _PyMainInterpreterConfig (and _PyPreConfig) *contained* _PyCoreConfig.

    But it the change became very large, I wasn't sure that it was a good idea, I abandonned my change.

    --

    Ok, something else. _PyCoreConfig (and _PyMainInterpreterConfig) contain memory allocated on the heap. Problem: Python initialization changes the memory allocator. Code using _PyCoreConfig requires some "tricks" to ensure that the memory is *freed with the same allocator used to *allocate memory.

    I created bpo-35265 "Internal C API: pass the memory allocator in a context" to pass a "context" to a lot of functions, context which contains the memory allocator but can contain more things later.

    The idea of "a context" came during the discussion about a new C API: stop to rely on any global variable or "shared state", but *explicitly* pass a context to all functions. With that, it becomes possible to imagine to have two interpreters running in the same threads "at the same time".

    Honestly, I'm not really sure that it's fully possible to implement this idea... Python has *so many "shared state", like *everywhere. It's really a giant project to move these shared states into structures and pass pointers to these structures.

    So again, I abandonned my experimental change: https://github.com/python/cpython/pull/10574

    --

    Memory allocator, context, different structures for configuration... it's really not an easy topic :-( There are so many constraints put into a single API!

    The conservation option at this point is to keep the API private.

    ... Maybe we can explain how to use the private API but very explicitly warn that this API is experimental and can be broken anytime... And I plan to break it, to avoid redundancy between core and main configuration for example.

    ... I hope that these explanations give you a better idea of the big picture and the challenges :-)

    zooba commented 5 years ago

    Thanks, Victor, that's great information.

    Memory allocator, context, different structures for configuration... it's really not an easy topic :-( There are so many constraints put into a single API!

    This is why I'm keen to design the ideal *user* API first (that is, write the examples of how you would use it) and then figure out how we can make it fit. It's kind of the opposite approach from what you've been doing to adapt the existing code to suit particular needs.

    For example, imagine instead of all the PySet*() functions followed by Py_Initialize() you could do this:

        PyObject *runtime = PyRuntime_Create();
        /* optional calls */
        PyRuntime_SetAllocators(runtime, &my_malloc, &my_realloc, &my_free);
        PyRuntime_SetHashSeed(runtime, 12345);
    /* sets this as the current runtime via a thread local */
    auto old_runtime = PyRuntime_Activate(runtime);
    assert(old_runtime == NULL)
    
    /* pretend triple quoting works in C for a minute ;) */
    const char *init = """
    import os.path
    import sys
        sys.executable = argv0
        sys.prefix = os.path.dirname(argv0)
        sys.path = [os.getcwd(), sys.prefix, os.path.join(sys.prefix, "Lib")]
    
        pyvenv = os.path.join(sys.prefix, "pyvenv.cfg")
        try:
            with open(pyvenv, "r", encoding="utf-8") as f:  # *only* utf-8 support at this stage
                for line in f:
                    if line.startswith("home"):
                        sys.path.append(line.partition("=")[2].strip())
                        break
        except FileNotFoundError:
            pass
    
        if sys.platform == "win32":
            sys.stdout = open("CONOUT$", "w", encoding="utf-8")
        else:
            # no idea if this is right, but you get the idea
            sys.stdout = open("/dev/tty", "w", encoding="utf-8")
        """;
        PyObject *globals = PyDict_New();
        /* only UTF-8 support at this stage */
        PyDict_SetItemString(globals, "argv0", PyUnicode_FromString(argv[0]));
        PyRuntime_Initialize(runtime, init_code, globals);
        Py_DECREF(globals);
    /* now we've initialised, loading codecs will succeed if we can find them or fail if not,
     * so we'd have to do cleanup to avoid depending on them without the user being able to
     * avoid it... */
        PyEval_EvalString("open('file.txt', 'w', encoding='gb18030').close()");
    
        /* may as well reuse DECREF for consistency */
        Py_DECREF(runtime);

    Maybe it's a terrible idea? Honestly I'd be inclined to do other big changes at the same time (make PyObject opaque and interface driven, for example).

    My point is that if the goal is to "move the existing internals around" then that's all we'll ever achieve. If we can say "the goal is to make this example work" then we'll be able to do much more.

    ericsnowcurrently commented 5 years ago

    On Wed, Feb 13, 2019 at 10:56 AM Steve Dower \report@bugs.python.org\ wrote:

    Nick, Victor, Eric, (others?) - are you interested in having a virtual whiteboard session to brainstorm how the "perfect" initialization looks? And probably a follow-up to brainstorm how to get there without breaking the world? I don't think we're going to get to be in the same room anytime before the language summit, and it would be awesome to have something concrete to discuss there.

    Count me in. This is a pretty important topic and doing this would help accelerate our efforts by giving us a clearer common understanding and objective. FWIW, I plan on spending at least 5 minutes of my 25 minute PyCon talk on our efforts to fix up the C-API, and this runtime initialization stuff is an important piece.

    ericsnowcurrently commented 5 years ago

    On Wed, Feb 13, 2019 at 5:09 PM Steve Dower \report@bugs.python.org\ wrote:

    This is why I'm keen to design the ideal *user* API first (that is, write the examples of how you would use it) and then figure out how we can make it fit. It's kind of the opposite approach from what you've been doing to adapt the existing code to suit particular needs.

    That makes sense. :)

    For example, imagine instead of all the PySet*() functions followed by Py_Initialize() you could do this:

    PyObject \*runtime = PyRuntime_Create();

    FYI, we already have a _PyRuntimeState struct (see Include/internal/pycore_pystate.h) which is where I pulled in a lot of the static globals last year. Now there is one process-global _PyRuntime (created in Python/pylifecycle.c) in place of all those globals. Note that _PyRuntimeState is in parallel with PyInterpreterState, so not a PyObject.

    /* optional calls \*/
    PyRuntime_SetAllocators(runtime, &my_malloc, &my_realloc, &my_free);
    PyRuntime_SetHashSeed(runtime, 12345);

    Note that one motivation behind PEP-432 (and its config structs) is to keep all the config together. Having the one struct means you always clearly see what your options are. Another motivation is to keep the config (dense with public fields) separate from the actual run state (opaque). Having a bunch of config functions (and global variables in the status quo) means a lot more surface area to deal with when embedding, as opposed to 2 config structs + a few initialization functions (and a couple of helpers) like in PEP-432.

    I don't know that you consciously intended to move away from the dense config struct route, so I figured I'd be clear. :)

    /* sets this as the current runtime via a thread local \*/
    auto old_runtime = PyRuntime_Activate(runtime);
    assert(old_runtime == NULL)

    Hmm, there are two ways we could go with this: keep using TLS (or static global in the case of _PyRuntime) or update the C-API to require explicitly passing the context (e.g. runtime, interp, tstate, or some wrapper) into all the functions that need it. Of course, changing that would definitely need some kind of compatibility shim to avoid requiring massive changes to every extension out there, which would mean effectively 2 C-APIs mirroring each other. So sticking with TLS is simpler. Personally, I'd prefer going the explicit argument route.

    /* pretend triple quoting works in C for a minute ;) \*/
    const char \*init_code = """

    [snip] """;

    PyObject \*globals = PyDict_New();
    /* only UTF-8 support at this stage \*/
    PyDict_SetItemString(globals, "argv0", PyUnicode_FromString(argv[0]));
    PyRuntime_Initialize(runtime, init_code, globals);

    Nice. I like that this keeps the init code right by where it's used, while also making it much more concise and easier to follow (since it's Python code).

    PyEval_EvalString("open('file.txt', 'w', encoding='gb18030').close()");

    I definitely like the approach of directly embedding the Python code like this. :) Are there any downsides?

    Maybe it's a terrible idea?

    Nah, we definitely want to maximize simplicity and your example offers a good shift in that direction. :)

    Honestly I'd be inclined to do other big changes at the same time (make PyObject opaque and interface driven, for example).

    Definitely! Those aren't big blockers on cleaning up initialization though, are they?

    My point is that if the goal is to "move the existing internals around" then that's all we'll ever achieve. If we can say "the goal is to make this example work" then we'll be able to do much more.

    Yep. I suppose part of the problem is that the embedding use cases aren't understood (or even recognized) well enough.

    ncoghlan commented 5 years ago

    Steve, you're describing the goals of PEP-432 - design the desired API, then write the code to implement it. So while Victor's goal was specifically to get PEP-540 implemented, mine was just to make it so working on the startup sequence was less awful (and in particular, to make it possible to rewrite getpath.c in Python at some point).

    Unfortunately, it turns out that redesigning a going-on-thirty-year-old startup sequence takes a while, as we first have to discover what all the global settings actually *are* :)

    https://www.python.org/dev/peps/pep-0432/#invocation-of-phases describes an older iteration of the draft API design that was reasonably accurate at the point where Eric merged the in-development refactoring as a private API (see https://bugs.python.org/issue22257 and https://www.python.org/dev/peps/pep-0432/#implementation-strategy for details).

    However, that initial change was basically just a skeleton - we didn't migrate many of the settings over to the new system at that point (although we did successfully split the import system initialization into two parts, so you can enable builtin and frozen imports without necessarily enabling external ones).

    The significant contribution that Victor then made was to actually start migrating settings into the new structure, adapting it as needed based on the goals of PEP-540.

    Eric updated quite a few more internal APIs as he worked on improving the subinterpreter support.

    Between us, we also made a number of improvements to https://docs.python.org/3/c-api/init.html based on what we learned in the process of making those changes.

    So I'm completely open to changing the details of the API that PEP-432 is proposing, but the main lesson we've learned from what we've done so far is that CPython's long history of embedding support *does* constrain what we can do in practice, so it's necessary to account for feasibility of implementation when considering what we want to offer.

    Ideally, the next step would be to update PEP-432 with a status report on what was learned in the development of Python 3.7 with the new configuration structures, and what the internal startup APIs actually look like now. Even though I reviewed quite a few of Victor and Eric's PR, even I don't have a clear overall picture of where we are now, and I suspect Victor and Eric are in a similar situation.

    ncoghlan commented 5 years ago

    Note also that Eric and I haven't failed to agree with Victor on an API, as Victor hasn't actually written a concrete proposal *for* a public API (neither as a PR updating PEP-432, nor as a separate PEP).

    The current implementation does NOT follow the PEP as written, because _Py_CoreConfig ended up with all the settings in it, when it's supposed to be just the bare minimum needed to get the interpreter to a point where it can run Python code that only accesses builtin and frozen modules.

    ncoghlan commented 5 years ago

    Since I haven't really written them down anywhere else, noting some items I'm aware of from the Python 3.7 internals work that haven't made their way back into the PEP-432 public API proposal yet:

    vstinner commented 5 years ago

    I created bpo-36142: "Add a new _PyPreConfig step to Python initialization to setup memory allocator and encodings".

    vstinner commented 5 years ago

    I wrote the PEP-587 "Python Initialization Configuration" which has been accepted. It allows to completely override the "Path Configuration". I'm not sure that it fully implementation what it requested here, but it should now be easier to tune the Path Configuration. See: https://www.python.org/dev/peps/pep-0587/#multi-phase-initialization-private-provisional-api

    I implemented the PEP-587 in bpo-36763.

    ac970517-7943-4610-bdab-4045a31a9505 commented 5 years ago

    To Victor: So how does the implementation of PEP-587 help configure embedded python with venv? It would be great help to provide some minimal instructions.

    ac970517-7943-4610-bdab-4045a31a9505 commented 4 years ago

    Just in case this will be of help to anyone, I found a way to use venvs in embedded python.

    import sys
    sys.executable = r"Path to the python executable inside the venv"
    path = sys.path
    for i in range(len(path)-1, -1, -1):
        if path[i].find("site-packages") > 0:
            path.pop(i)
    import site
    site.main()
    del sys, path, i, site
    zooba commented 4 years ago

    If you just want to be able to import modules from the venv, and you know the path to it, it's simpler to just do:

        import sys
        sys.path.append(r"path to venv\Lib\site-packages")

    Updating sys.executable is only necessary if you're going to use libraries that try to re-launch itself, but any embedding application is going to have to do that anyway.

    ac970517-7943-4610-bdab-4045a31a9505 commented 4 years ago

    To Steve:

    I want the embedded venv to have the same sys.path as if you were running the venv python interpreter. So my method takes into account for instance the include-system-site-packages option in pyvenv.cfg. Also my method sets sys.prefix in the same way as the venv python interpreter.

    348d154b-9387-4632-ad74-398fd999ff6e commented 4 years ago

    I just can say that sorting this issue (and PEP-0432) out would be great! I run into this issue when embedding CPython in a Windows app, and want to use some pre-installed Python, which is not part of the install package... So beside pyenv venvs, please keep Windows devs in mind, too! :)

    zooba commented 4 years ago

    I run into this issue when embedding CPython in a Windows app, and want to use some pre-installed Python, which is not part of the install package...

    You'll run into many more issues if you keep doing this...

    The way to use a pre-installed Python on Windows is to follow PEP-514 to find and run "python.exe" (or force your users to learn how to configure PATH, which is pretty hostile IMHO, but plenty of apps do it anyway).

    If you really need to embed, then add the embeddable package (available from our downloads page) into your distribution and refer to that. Then you can also bundle whatever libraries you need and set up sys.path using the ._pth file.

    d0148cd8-789b-42e3-8dea-c06aad871cbd commented 4 years ago

    As a side-note: In my case I am embedding Python in a C program for several reasons:

    I'm using virtual environments only in the Linux version, the Windows version uses the embeddable ZIP distribution.

    The Linux version was working with Python 2.7 and "virtualenv". Now I'm updating to Python 3.6 and "venv" and running into this issue.

    It seems like virtualenv can handle the situation, but venv can't. Maybe it is worth looking at what virtualenv does differently?

    fe491b48-23c2-4033-aaa2-1a6613895466 commented 4 years ago

    For the record. Since virtualenv 20.0.0 (or there about) switched to the python -m venv style virtual environment structure, the C API for embedding when using a virtual environment is now completely broken on Windows. The same workaround used on UNIX doesn't work on Windows.

    The only known workaround is in the initial Python code you load, to add:

    import site
    site.addsitedir('C:/some/path/to/pythonX.Y/Lib/site-packages')

    to at least force it to use the site-packages directory from the virtual environment.

    As to mod_wsgi, means that on Windows the WSGIPythonHome directive no longer works anymore and have to suggest that workaround instead.

    vstinner commented 3 years ago

    See also "Configure Python initialization (PyConfig) in Python" https://mail.python.org/archives/list/python-dev@python.org/thread/HQNFTXOCDD5ROIQTDXPVMA74LMCDZUKH/#X45X2K4PICTDJQYK3YPRPR22IGT2CDXB

    And bpo-42260: [C API] Add PyInterpreterState_SetConfig(): reconfigure an interpreter.

    superchromix commented 1 year ago

    Is there some update on this issue? I'm also encountering problems when trying to embed a specific Python environment into a C++ application (running on Windows). It seems like no tutorial exists for how to properly set up the various paths, etc. within the PyConfig object before calling Py_InitializeFromConfig.

    ncoghlan commented 2 months ago

    Reading the code in https://github.com/python/cpython/blob/3.11/Modules/getpath.py suggests that setting executable in the config or PYTHONEXECUTABLE in the environment should lead to the search for pyvenv.cfg happening in the desired location (at least in 3.11+, when the path config migrated to common frozen Python code rather than the old mess of platform specific C code). (Tangent: the code also suggests that the use_environment CLI flag isn't being respected as far as this envvar and the __PYVENV_LAUNCHER__ envvar are concerned)

    The documentation for PYTHONEXECUTABLE is really misleading on this front though, since it makes no reference to its role in sys.path initialisation, only its role in overriding argv[0] on macOS.

    (I currently have need of this functionality, so I'll report back soonish on whether this actually works the way reading the code suggests it should)

    ncoghlan commented 2 months ago

    From initial investigation, it looks like this works to set up an embedded Python with an active virtual environment (example code uses pybind11 because that's what I'm using in the project where I need this functionality):

    #include <iostream>
    
    #include <pybind11/embed.h> // everything needed for embedding
    namespace py = pybind11;
    
    int run_with_config(PyConfig &config) {
        // Run with given config, no CLI args, and without adding the program dir to sys.path
        py::scoped_interpreter guard{&config, 0, nullptr, false};
    
        py::print("Hello from the Python runtime embedding demo app!");
        py::exec("import json, sys; print('sys.path: ', json.dumps(sys.path, indent=2))");
        return 0;
    }
    
    // Handling unicode paths correctly needs a different main function invocation on Windows
    static const std::string CLI_ARGS_ERROR = 
        "Error: Must specify path to Python binary in virtual environment";
    
    #ifdef MS_WINDOWS
    int
    wmain(int argc, wchar_t **argv)
    {
        if (argc < 2) {
            std::cerr << CLI_ARGS_ERROR << std::endl;
            return -1;
        }
        PyConfig config;
        PyConfig_InitIsolatedConfig(&config);
        config.executable = _wcsdup(argv[1]);
        return run_with_config(config);
    }
    #else
    int
    main(int argc, char **argv)
    {
        if (argc < 2) {
            std::cerr << CLI_ARGS_ERROR << std::endl;
            return -1;
        }
        std::cout << "CLI args: " << argv[1] << std::endl;
        PyConfig config;
        PyConfig_InitIsolatedConfig(&config);
        PyConfig_SetBytesString(&config, &config.executable, argv[1]);
        std::wcout << L"Python config: " << config.executable << std::endl;
        return run_with_config(config);
    }
    #endif

    Caveats on this:

    Edit: added missing _wcsdup call following successful Windows testing, updated caveats based on that testing.

    ncoghlan commented 2 months ago

    Hmm, actually, that Unicode issue could also be related to the path to the base Python runtime in pyvenv.cfg (in the case where it throws an unhandled encoding related OSError from getpath_readlines, the base runtime is in a path containing a character from outside the base multilingual plane, whereas my demo env only has such a character in the venv path - the base runtime is in a system directory).

    I realised after a bit more thought that this speculation has to be wrong, since the pyvenv.cfg file works fine when using the venv normally via the symlinked CPython binary instead of attempting to use it via the embedding app (which is currently linked against the wrong base runtime).

    zooba commented 2 months ago

    Honestly, your best bet is probably to shell out to {argv[1]} -c "import sys; print(*sys.path, sep='\n')" and then read the output into config.module_search_paths.

    You'll need to handle sys.executable and sys._base_executable configuration carefully in a host app anyway (and TBH I'm not sure it's possible), but at least you can ensure the search paths are correct, even if creating a nested venv or detecting a build directory doesn't work (not sure why you'd want those, but the existence of this issue suggests someone does...)

    ncoghlan commented 2 months ago

    In this use case, there is a genuine full Python venv available, so setting sys.executable to point at that binary is genuinely the right thing to do anyway. If it also happens to make everything else work as desired... well, I'll be rather happy if that's how it works out :)

    (further investigation will happen Friday AEST, since this project is a part-time gig)

    ncoghlan commented 2 months ago

    While more testing with Python 3.12 could be worthwhile, setting sys.executable is definitely not enough to make things work with the python-build-standalone 3.11 builds (it was a Fedora 3.12 system build that gave me the initial impression that simply setting sys.executable was enough to pick up the venv correctly)

    This was a testing error on my part (I hadn't recreated the test venv with the right version of Python after fixing the linking in the embedding app, so it was still pointing back to the system Python install and hence the original version mismatch still existed, just in the opposite direction).

    With the linking and venv creation fixed to be consistent, setting config.executable (as in the pybind11 code above) genuinely looks to be enough to make venv discovery work properly.

    ncoghlan commented 2 months ago

    On Windows, it appears setting only config.executable segfaults the interpreter during initialisation. I'm still working to figure exactly where it is dying, but I have confirmed that setting all the path related config values so the dynamic calculation is never triggered makes the failure go away (faulthandler isn't able to pinpoint the location of the failure, as there's no Python runtime for it to use, so it just reports that there's no Python frame available).

    FFY00 commented 2 months ago

    Hey Alyssa, Steve, personally I feel that instead of trying to hijack the executable field here, a better approach would be allowing users to specify a pyvenv.cfg path when embedding. This would only require a minimal change to getpath, so that it would check if a pyvenv.cfg path is specified, and fallback to the regular discovery if not. What do you think?

    I have development time from $dayjob that I can use to implement this (or other CPython changes btw, feel free to ping me on other issues).

    ncoghlan commented 2 months ago

    @FFY00 The issue with allowing the path inference to run in the other direction is that it adds complexity to the implementation without adding expressiveness to the configuration scheme (presumably if the embedding app knows where the venv's config file is, it also knows where the Python executable is to be found).

    A venv-specific config setting would also only become available in Python 3.14 at the earliest, whereas once we track down the source of the 3.11 segfault on Windows, we'll be able to provide guidance that works as far back as Python 3.11.

    As far as the 3.11 segfault goes, as near as I can tell, all the known segfault fixes in getpath.c since 3.11 was released made it into 3.11.9, so those aren't the problem. I'll try running the embedding application up under the MSVC debugger to see if that can narrow down the problem.

    ncoghlan commented 2 months ago

    Turning off the sys.path calculation by setting config.module_search_paths_set = 1; produced some useful debugging output (due to failing to import the encodings module) rather than a segfault:

    Python path configuration:
      PYTHONHOME = (not set)
      PYTHONPATH = (not set)
      program name = 'python'
      isolated = 1
      environment = 0
      user site = 0
      safe_path = 1
      import site = 1
      is in build tree = 0
      stdlib dir = 'S:\path\to\work\project\_build_win64\ud83d\udc38\cpython@3.11\Lib'
      sys._base_executable = 'S:\\path\\to\\work\\project\\_build_win64\U0001f438\\cpython@3.11\\python.exe'
      sys.base_prefix = 'S:\\path\\to\\work\\project\\_build_win64\U0001f438\\cpython@3.11'
      sys.base_exec_prefix = 'S:\\path\\to\\work\\project\\mf3-prototype\\_build_win64\U0001f438\\cpython@3.11'
      sys.platlibdir = 'DLLs'
      sys.executable = 'S:\\path\\to\\work\\project\\_build_win64\U0001f438\\project_venv\\Scripts\\python.exe'
      sys.prefix = 'S:\\path\\to\\work\\project\\_build_win64\U0001f438\\cpython@3.11'
      sys.exec_prefix = 'S:\\path\\to\\work\\project\\_build_win64\U0001f438\\cpython@3.11'
      sys.path = [
      ]

    The unescaped backslashes in the "stdlib dir" output are an artifact of dumping a wchar_t string as ASCII rather than dumping an actual unicode object (which is the way the other paths are printed).

    The encoding of "\U0001f438" as "\ud83d\udc38" looks genuinely wrong however, and brings to mind this change @vstinner made to avoid an incorrect attempt at encoding to UTF-8 during the path calculations (it can't be that specific issue however, as that fix was backported to 3.11).

    ncoghlan commented 2 months ago

    Finally got the Windows demo running under the MSVC debugger, and the issue turned out to be really trivial: pybind11 automatically clears the config after initialising the interpreter, and I hadn't called _wcsdup when adding a CLI argument to the config, so PyConfig_Clear died when trying to release the pointer to a CLI args entry.

    This was only a problem on Windows because the bytes CLI necessarily had to use PyConfig_SetBytesString rather than passing a pointer to the CLI arg directly.

    I'm still not sure what's going on with the weird representation of the Unicode value in the paths, but it works (presumably it's related to the reason why Windows operates under "utf-8/surrogatepass" semantics for operating system interfaces).

    ncoghlan commented 2 months ago

    I updated my example code above to include the missing _wcsdup call in the Windows code.

    I've also reworked the labels to indicate that this is now a documentation and test coverage issue: to embed a venv, set executable (or PYTHONEXECUTABLE) to point to the Python binary in that environment, do NOT set PYTHONHOME (or its config equivalent).

    The version changed notes in the updated docs should reference Python 3.11, since that refactoring to implement the path config process more consistently across platforms is when it started working.

    benh commented 2 months ago

    @ncoghlan is there a known way to accomplish this pre 3.11? Specifically in 3.10? And specifically on macOS?

    zooba commented 1 month ago

    I feel that instead of trying to hijack the executable field here, a better approach would be allowing users to specify a pyvenv.cfg path when embedding

    The better approach is to just set the search paths you want to have, and leave the python.c-specific functionality to python.c.[^1] venv initialization is not supported in any manner other than the activate script and running the regular Python executable, and I'd prefer to keep it that way.

    There are more than enough configuration options to initialise exactly the runtime you want.

    [^1]: I know it doesn't live in python.c, but it should. Poor architectural decisions in the past don't necessarily have to force us to grow the feature - we can still avoid that creep.

    ncoghlan commented 1 month ago

    There are more than enough configuration options to initialise exactly the runtime you want.

    @zooba In my use case, the runtime I want to embed is available as a virtual environment set up with standard package installation tooling, so having to reimplement the virtual environment sys.path derivation logic externally (where any deviation from CPython's behaviour would be considered a bug in the embedding application) would be annoying compared to instead telling the runtime:

    1. Use the same sys.path config that this Python executable would use
    2. Use this Python executable when running Python subprocesses (i.e. set it as sys.executable)

    The fact that setting (the init config equivalent of) PYTHONEXECUTABLE works feels like a genuinely good fit for my problem (primarily defining virtual environments for executing Python AI-related code via CPython, but also supporting direct embedding of those virtual environments rather than invoking their Python executable as a subprocess).

    Is there a known way to accomplish this pre 3.11? Specifically in 3.10? And specifically on macOS?

    @benh PYTHONEXECUTABLE was original macOS-only and became cross-platform in Python 3.11, so it's worth a try. If it doesn't work, then I'd say the answer to your question is "No, Python 3.11 will be needed to get this behaviour".

    zooba commented 1 month ago

    In my use case, the runtime I want to embed is available as a virtual environment set up with standard package installation tooling

    I think we're differing on a few terms/concepts here:

    I really don't want to encourage embedding of "whatever version of Python I found on disk". The only time that's going to lead to a good experience is when it's the system interpreter, which isn't "whatever" version. But taking an arbitrary install and doing anything other than launching python[x.y] as a child process is not going to serve you well long-term. It's 100x easier to support arbitrary versions in a Python script with IPC than by embedding - it's even 10x easier to support arbitrary versions that load a native module to move your work into the Python process. Embedding is hard enough without letting users decide to break your app for you.

    If you can ensure that your users are using the base runtime you expect, then if you really need the same search paths, I'd really suggest running the interpreter with -c print(sys.path) and reading that back into your embedding config. Cloning getpath.py is pretty much a non-starter (CPython isn't even consistent with itself, let alone external copies of that logic!), and "things that python.exe does" are not a supported part of the embedding interface (including venv calculations, but also parsing argv and the process environment for settings - these aren't even specified for python.exe apart from implementation).

    ncoghlan commented 1 month ago

    I know which runtime I am embedding (the primary project is creating and shipping a fully integrated set of Python base runtimes and then virtual environments that use those runtimes).

    Further allowing for embedding applications that use the environments directly via the CPython C API is an alternative we're considering to running components in the environments up as subprocesses and communicating with them via FastAPI (essentially, we'd be adapting the existing C++ application server that is used for Typescript component embedding rather than shipping a separate Python-only application server implementation solely for the Python components and having to maintain two separate implementations of the local client application authentication and authorisation bits).

    At the very least, those embedding apps need to set sys.executable properly for the benefit of Python subprocess invocations, and the right executable path to use for that purpose is the one in the virtual environment being implicitly activated.

    While it's undocumented, that has turned out to also have the effect of getting sys.path set the same way as it would be when running Python in that virtual environment, including loading *.pth files and importing sitecustomize, which is exactly the outcome I want.

    I do think it would be a good idea to document a way to check for runtime compatibility before attempting to start the embedded interpreter when relying on setting PYTHONEXECUTABLE in this way. For example, run the following snippet in the nominally compatible executable and compare the resulting string to the string form of Py_VERSION_HEX & 0xFFFFFF00:

    $python3 -c "import sys; major, minor, micro, *_ = sys.version_info; print(f'{major:#04x}{minor:02x}{micro:02x}00')"
    0x030c0300

    (the arcane failures you get when mixing and matching 3.11 venvs with a Python 3.12 runtime, and vice-versa, are cryptic enough that I agree we don't want to risk giving the impression that folks can point an embedding config at an arbitrary virtual environment and expect it to magically work even when the specificed interpreter's ABI is inconsistent with what the embedding application expects)

    zooba commented 1 month ago

    At the very least, those embedding apps need to set sys.executable properly for the benefit of Python subprocess invocations

    Yeah, this is the intended use of setting the executable. It's expecting/relying on it also doing the pyvenv.cfg lookup that I don't want to commit to. Specifying the search path normally should also resolve *.pth files, unless you suppress import site. (I guess there's a reasonable argument that venv setup should be in site.main() too, though we definitely need to resolve the base runtime earlier so there's no real getting around it).

    I do think it would be a good idea to document a way to check for runtime compatibility before attempting to start the embedded interpreter

    Agreed. Do we not have a pre-init function that returns the hex version already? (I guess I've never worried too much because Windows doesn't have a versionless DLL that gives you non-stable ABI.)