pybind / pybind11_bazel

Bazel wrapper around the pybind11 repository
Other
94 stars 53 forks source link

How to set python interpreter within cc_binary #77

Open ptr-br opened 4 months ago

ptr-br commented 4 months ago

I'm trying to use pybind11_bazel to execute some python code within C++. When I build my python targets with rules_python I'm able to install packages that can be used by the interpreter. However, when I link these to the data= attribute of cc_binary and use pybind11 as a dependency, the default interpreter at /usr/bin/python3 is used instead of the one from the python targets.

Is there any elegant way of telling bazel to use the same interpreter as for the python part?

If further explanation is needed I could come up with a minimal example, please let me know. I'm rather new to bazel so thanks for your help!

Thanks.

junyer commented 4 months ago

Yeah, could you please put together a minimal example (in its own repository) that reproduces the problem? I'm struggling to understand whether you mean that you are trying to "embed" Python or that the shebang lines are wrong for some reason or that something else is happening. Thanks!

ptr-br commented 4 months ago

@junyer, I created a toy example here. My main problem is creating/selecting an interpreter for cc_binary like py_binary does...

junyer commented 4 months ago

This sounds like a problem that @rickeylev would know how to solve... What happens if you do this to pin the Python version for the np_wrapper_lib target?

ptr-br commented 4 months ago

I'm actually already pinning it here. When I change the version from my system one (3.10) to some other version (e.g. 3.11) I get the following error running bazel run //cc:my_cc_binary:

INFO: Analyzed target //cc:my_cc_binary (75 packages loaded, 1482 targets configured).
INFO: Found 1 target...
Target //cc:my_cc_binary up-to-date:
  bazel-bin/cc/my_cc_binary
INFO: Elapsed time: 0.450s, Critical Path: 0.01s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/cc/my_cc_binary
Could not find platform independent libraries <prefix>
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = (not set)
  program name = 'python3'
  isolated = 0
  environment = 1
  user site = 1
  safe_path = 0
  import site = 1
  is in build tree = 0
  stdlib dir = '/install/lib/python3.11'
  sys._base_executable = '/usr/bin/python3'
  sys.base_prefix = '/install'
  sys.base_exec_prefix = '/usr'
  sys.platlibdir = 'lib'
  sys.executable = '/usr/bin/python3'
  sys.prefix = '/install'
  sys.exec_prefix = '/usr'
  sys.path = [
    '/install/lib/python311.zip',
    '/install/lib/python3.11',
    '/usr/lib/python3.11/lib-dynload',
  ]
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to get the Python codec of the filesystem encoding
Aborted (core dumped)
junyer commented 4 months ago

You aren't doing anything with @python_versions? As per the documentation, pinning the Python version would mean doing something like load("@python_versions//3.11:defs.bzl", "py_binary") in the BUILD file.

ptr-br commented 4 months ago

I played around with the version and actually switching works. E.g. python version that is used in my_cc_binary is python3.9 instead of my default 3.10 system python when specifing:

# MODULE.bazel
...
python.toolchain(
    python_version = "3.9",
    is_default = True,
)

use_repo(python, "python_versions")

pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
    hub_name = "my_pip",
    python_version = "3.9",
    requirements_lock = "//cc:requirements.txt",
)
...

# BUILD.bazel
load("@python_versions//3.9:defs.bzl","py_binary")
...

I get the PYTHONPATH values:

/usr/lib/python3.9
/usr/lib/python3.9/lib-dynload
/usr/local/lib/python3.9/dist-packages
/usr/lib/python3/dist-packages
Python version: 3.9.18 (main, Oct  3 2023, 01:30:02)

Could the problem be that py_binary does add all the paths from the required directories to the PYTHONPATH and those are not added when simply specifying the data property in cc_binary?

What would be the best way to solve this then?

junyer commented 4 months ago

Given that your example didn't work with load("@rules_python//python:defs.bzl", "py_binary") and now does work with load("@python_versions//3.9:defs.bzl", "py_binary"), the problem is likely to be on the rules_python side. I'm reminded of https://github.com/bazelbuild/rules_python/issues/1069#issuecomment-1942053014 in particular: this may be a subtle bug in the Starlark implementation. Could you please file a bug against rules_python?

ptr-br commented 4 months ago

Sorry for the confusion, but the problem was not resolved by switching to load("@python_versions//3.9:defs.bzl", "py_binary"). I only verified that pybind is using another interpreter. So switching the interpreter works, but not installing and using dependencies as it is resolved when using py_binary...

junyer commented 4 months ago

Oh. :(

Also, I see now that I misspoke earlier:

What happens if you do this to pin the Python version for the np_wrapper_lib target?

I should have said "the np_wrapper target", not "the np_wrapper_lib target", because the py_binary() rule can be pinned whereas the py_library() rule can't. Would data = ["//python:np_wrapper"] work with the my_cc_binary target?

rickeylev commented 4 months ago

I think what's happening is the embedded interpreter is trying to take settings from the local environment. I saw this when I was trying to construct a runnable test linking with the hermetic python libraries: it kept trying to "escape" and use things from the local system. Eventually I traced it back to the Py_Initialize() call trying to automatically fill in various details based on the environment settings.

The docs for how to initialize an embedded interpreter are here: https://docs.python.org/3/extending/embedding.html

I think the two key things that have to be setup are:

  1. Where the Python runtime is installed (so it can find the stdlib et al)
  2. The import search paths (so it can find pip-installed dependencies)

For (1), I think this can be derived based on the location of e.g. the header files. You basically need the runfiles-relative path to where the stdlib etc are in the runfiles.

For (2), this information comes from PyInfo.imports.

What we probably want to do is generate a cc file with those values in them. Maybe something like this:

def _py_cc_init_info_impl(ctx)
  toolchain = ctx.toolchains["@rules_python//python/cc:toolchain_type"]
  runtime_dir = <get File.short_path from toolchain.headers or .libs>

  path_entries = []
  for info in [t[PyInfo] for t in ctx.attr.deps]:
    path_entries.extend(info.imports)

  sys_path = ":".join(path_entries)

  header = ctx.actions.declare_file("info.h")
  ctx.actions.write(header, "string runtime_dir = {}; string sys_path= {}".format(
    runtime_dir, sys_path))

  return [DefaultInfo(files=[header])]

py_cc_init_info = rule(
  implementation = _py_cc_init_info,
  attrs = {"deps": attr.label_list()},
  toolchains = ["@rules_python//python/cc:toolchain_type"],
)

At the lesat, it probably makes sense to add (1) to the py_cc_toolchain info as e.g. a runtime_install_location (equiv of PYTHONHOME?) attribute or something, to avoid having to try and unpack so much.

axbycc-mark commented 3 months ago

@rickeylev I faced this issue myself and I hacked a solution almost as you describe. However I discovered a conceptual ambiguity. In your rule you use the py_cc_info. As implemented, py_cc_info is about compile time dependencies since it exposes only that information related to compiling and linking a cc_library that depends on libpython.

A cc_binary which embeds Python, however, needs to also express a runtime/data dependency on a certain collection files (Lib/, DLLS/, ...) currently listed under the "files" filegroup of the instantiated python toolchain repository (and also a a data dependency on any third party .py files). The "files" filegroup is exposed through the py_runtime rule of the @bazel_tools//tools/python:toolchain_type toolchain.

In my hacks, I made use of the @bazel_tools//tools/python:toolchain_type toolchain (py_runtime instead of py_cc_info) to prepare such metadata for embedding runtime python dependency files. I successfully built a binary with an embedded python interpreter.

Here is my implementation

# .bzl 

def _py_embedded_libs_impl(ctx):
    deps = ctx.attr.deps
    toolchain = ctx.toolchains["@bazel_tools//tools/python:toolchain_type"]
    py3_runtime = toolchain.py3_runtime

    # addresses that need to be added to python sys.path
    all_imports = []        
    for lib in deps:
        all_imports.append(lib[PyInfo].imports)
    imports_txt = "\n".join(depset(transitive = all_imports).to_list())
    imports_file = ctx.actions.declare_file(ctx.attr.name + ".imports")
    ctx.actions.write(imports_file, imports_txt)

    python_home_txt = str(py3_runtime.interpreter.dirname)
    python_home_file = ctx.actions.declare_file(ctx.attr.name + ".python_home")
    ctx.actions.write(python_home_file, python_home_txt)

    py3_runfiles = ctx.runfiles(files = py3_runtime.files.to_list())
    dep_runfiles = [py3_runfiles]
    for lib in deps:
        lib_runfiles = ctx.runfiles(files = lib[PyInfo].transitive_sources.to_list())
        dep_runfiles.append(lib_runfiles)
        dep_runfiles.append(lib[DefaultInfo].default_runfiles)

    runfiles = ctx.runfiles().merge_all(dep_runfiles)

    return [DefaultInfo(files=depset([imports_file, python_home_file]),
                        runfiles=runfiles
                        )]

# collect paths to all files of a python library and generate .imports file and .python_home file
py_embedded_libs = rule(
    implementation = _py_embedded_libs_impl,
    attrs = {
        "deps": attr.label_list(
            providers = [PyInfo], 
        ),
    },
    toolchains = [
        str(Label("@bazel_tools//tools/python:toolchain_type")),
    ],        
)
# BUILD

py_embedded_libs(
    name = "embed_paths",
    deps = [
        "@pip//scipy:pkg"
    ])

cc_binary(
    name = "embed",
    srcs = ["embed.cpp"],
    deps = [
        "//:current_libpython_unstable", # hacks around issue #1823 , cannot use current_py_cc_libs yet
        "@bazel_tools//tools/cpp/runfiles", # needed to resolve python sys.path additions, and python home location
    ],
    data = [":embed_paths"]
)

Then again I'm pretty unskilled with Bazel and you can probably figure out a better way to do this. I just thought it might help to post my learnings here.

rickeylev commented 3 months ago

a cc binary also needs py_runtime.files

Ahhh yes, excellent point. This seems obvious once you said it. So really, we don't need a runtime_install_dir value, but a depset[File] (or, actually, maybe a runfiles, since they are runtime files) of what the runtime needs. Or, actually, maybe both (a locally installed runtime can just point to that directory instead). Good food for thought, thanks.

The bzl code you posted looks pretty correct. You probably want .short_name instead of .dirname (the latter isn't a runfiles path, iirc). There are some minor optimizations you could make (e.g. avoiding to_list() calls; write() can be passed an args object sing Args.add_all(map_each=...), which can be used to defer depset flattening to execution phase and still allow writing mostly-arbitrary lines to a file).

ahojnnes commented 1 month ago

@axbycc-mark Could you share an example for how you use the imports_file and the python_home_file in a cc_binary/cc_test target to appropriately set the PYTHONPATH and PYTHONHOME? When depending on numpy using your suggest approach above, I am consistently getting: ModuleNotFoundError: No module named 'numpy'. Thank you!

axbycc-mark commented 1 month ago

@ahojnnes Continuing my example from above, here is the code I had in my actual .cc file.

#include <iostream>
#include <fstream>
#include <Python.h>
#include <filesystem>
#include "tools/cpp/runfiles/runfiles.h"
#include <print>

using bazel::tools::cpp::runfiles::Runfiles;

void InitializePythonEnvironment(const std::string& pythonHome, const std::vector<std::string>& additionalPaths) {

    PyStatus status;
    PyConfig config;
    PyConfig_InitPythonConfig(&config);

    // Set PYTHONHOME
    wchar_t* pythonHomeW = Py_DecodeLocale(pythonHome.c_str(), nullptr);
    status = PyConfig_SetString(&config, &config.home, pythonHomeW);
    if (PyStatus_Exception(status)) {
        PyConfig_Clear(&config);
        Py_ExitStatusException(status);
    }       

    config.isolated = 1;

    // Initialize the interpreter with the given configuration
    status = Py_InitializeFromConfig(&config);
    if (PyStatus_Exception(status)) {
        Py_ExitStatusException(status);
    }

    // The PyConfig structure should be released after initialization
    PyConfig_Clear(&config);        

    // Check if initialization was successful
    if (!Py_IsInitialized()) {
        LOG(FATAL) << "Failed to initialize Python interpreter.";
    }

    // Import the sys module.
    PyObject* sysModule = PyImport_ImportModule("sys");
    if (!sysModule) {
        LOG(FATAL) << "Failed to import 'sys' module." << std::endl;
    }

    // Get the sys.path list.
    PyObject* sysPath = PyObject_GetAttrString(sysModule, "path");
    if (!sysPath) {
        LOG(FATAL) << "Failed to get 'sys.path'." << std::endl;
    }

    // Add each path in additionalPaths to sys.path.
    for (const auto& path : additionalPaths) {
        PyObject* pyPath = PyUnicode_FromString(path.c_str());
        if (!pyPath) {
            std::cerr << "Failed to create Python string from path." << std::endl;
            continue;
        }
        if (PyList_Append(sysPath, pyPath) != 0) {
            std::cerr << "Failed to append path to 'sys.path'." << std::endl;
        }
        Py_DECREF(pyPath);
    }

    // Clean up references.
    Py_DECREF(sysPath);
    Py_DECREF(sysModule);
    Py_DECREF(pythonHomeW); // Clean up the allocated string.

    // At this point, the Python interpreter is initialized, PYTHONHOME is set,
    // and additional paths have been added to sys.path. You can now proceed to
    // execute Python scripts or finalize the interpreter as needed.
}

void PrintSysPath() {
    // Import the sys module.
    PyObject* sysModule = PyImport_ImportModule("sys");
    if (!sysModule) {
        PyErr_Print(); // Print any error if occurred
        std::cerr << "Failed to import 'sys' module." << std::endl;
        return;
    }

    // Get the sys.path list.
    PyObject* sysPath = PyObject_GetAttrString(sysModule, "path");
    if (!sysPath || !PyList_Check(sysPath)) {
        PyErr_Print(); // Print any error if occurred
        std::cerr << "Failed to access 'sys.path'." << std::endl;
        Py_XDECREF(sysModule); // Py_XDECREF safely decrements the ref count if the object is not NULL
        return;
    }

    // Get the size of sys.path list to iterate over it
    Py_ssize_t size = PyList_Size(sysPath);
    for (Py_ssize_t i = 0; i < size; i++) {
        PyObject* path = PyList_GetItem(sysPath, i); // Borrowed reference, no need to DECREF
        if (path) {
            const char* pathStr = PyUnicode_AsUTF8(path);
            if (pathStr) {
                                std::println("\t{}", pathStr);
            } else {
                PyErr_Print(); // Print any error if occurred
            }
        }
    }

    // Clean up: DECREF objects created via PyImport_ImportModule and PyObject_GetAttrString
    Py_DECREF(sysPath);
    Py_DECREF(sysModule);
}

void ImportAndPrintVersion(const std::string& python_module_name) {
    // Import the scipy module.
    PyObject* pyModule = PyImport_ImportModule(python_module_name.c_str());
    if (!pyModule) {
        PyErr_Print(); // Print the error to stderr.
        std::cerr << "Failed to import module" << python_module_name << std::endl;
        return;
    }

    // Access the __version__ attribute of the module.
    PyObject* version = PyObject_GetAttrString(pyModule, "__version__");
    if (!version) {
        PyErr_Print(); // Print the error to stderr.
        std::cerr << "Failed to get '__version__'." << std::endl;
        Py_DECREF(pyModule);
        return;
    }

    // Convert the version PyObject to a C string.
    const char* versionStr = PyUnicode_AsUTF8(version);
    if (!versionStr) {
        PyErr_Print(); // Print the error to stderr.
        std::cerr << "Failed to convert '__version__' to C string." << std::endl;
    } else {
        // Print the version string.
                std::println("\t{} version: {}", python_module_name, versionStr);
    }

    // Clean up references.
    Py_DECREF(version);
    Py_DECREF(pyModule);
}

std::vector<std::string> read_lines(const std::string& path) {
    std::ifstream file(path);

        CHECK(file.is_open()) << "Could not open file " << path;

    std::string line;
        std::vector<std::string> lines;     
    while (std::getline(file, line)) {
        lines.push_back(line);
    }

        return lines;
}

int main(int argc, char* argv[]) {

        std::string error;
        std::unique_ptr<Runfiles> runfiles(
                Runfiles::Create(argv[0], BAZEL_CURRENT_REPOSITORY, &error));
        CHECK(runfiles) << "Could not create runfiles";

        std::string dot_python_home_path = runfiles->Rlocation("_main/python/experimental/embed_paths.python_home");
        std::string python_home_path = read_lines(dot_python_home_path).front();
        std::string python_home_path_absolute = runfiles->Rlocation("_main/" + python_home_path);
        auto external_dir = std::filesystem::path(python_home_path_absolute).parent_path();

        std::string dot_imports_path = runfiles->Rlocation("_main/python/experimental/embed_paths.imports");
        std::vector<std::string> imports = read_lines(dot_imports_path);
        std::vector<std::string> absolute_imports;
        for (const std::string& relative_import : imports) {
                const auto absolute_import = runfiles->Rlocation(relative_import);
                absolute_imports.push_back(absolute_import);
        }

        InitializePythonEnvironment(python_home_path_absolute, absolute_imports);
        std::cout << "Initialized. Dumping sys path." << "\n";

        PrintSysPath();

        std::cout << "Testing module import" << "\n";
        ImportAndPrintVersion("numpy");
        ImportAndPrintVersion("scipy");

        Py_Finalize();

    return 0;
}

Can you see if this little program works for you?

ahojnnes commented 1 month ago

@axbycc-mark Thank you very much. This is very helpful.

ahojnnes commented 1 month ago

After hacking at this for a bit, I came up with the following rule/macro combination that doesn't require any custom C++ code:

def _cc_py_runtime_impl(ctx):
    toolchain = ctx.toolchains["@bazel_tools//tools/python:toolchain_type"]
    py3_runtime = toolchain.py3_runtime
    imports = []
    for dep in ctx.attr.deps:
        imports.append(dep[PyInfo].imports)
    python_path = ""
    for path in depset(transitive = imports).to_list():
        python_path += "external/" + path + ":"

    py3_runfiles = ctx.runfiles(files = py3_runtime.files.to_list())
    runfiles = [py3_runfiles]
    for dep in ctx.attr.deps:
        dep_runfiles = ctx.runfiles(files = dep[PyInfo].transitive_sources.to_list())
        runfiles.append(dep_runfiles)
        runfiles.append(dep[DefaultInfo].default_runfiles)

    runfiles = ctx.runfiles().merge_all(runfiles)

    return [
        DefaultInfo(runfiles = runfiles),
        platform_common.TemplateVariableInfo({
            "PYTHON3": str(py3_runtime.interpreter.path),
            "PYTHONPATH": python_path,
        }),
    ]

_cc_py_runtime = rule(
    implementation = _cc_py_runtime_impl,
    attrs = {
        "deps": attr.label_list(providers = [PyInfo]),
    },
    toolchains = [
        str(Label("@bazel_tools//tools/python:toolchain_type")),
    ],
)

def cc_py_test(name, py_deps = [], **kwargs):
    py_runtime_target = name + "_py_runtime"
    _cc_py_runtime(
        name = py_runtime_target,
        deps = py_deps,
    )

    kwargs.update({
        "data": kwargs.get("data", []) + [":" + py_runtime_target],
        "env": {"__PYVENV_LAUNCHER__": "$(PYTHON3)", "PYTHONPATH": "$(PYTHONPATH)"},
        "toolchains": kwargs.get("toolchains", []) + [":" + py_runtime_target],
    })

    native.cc_test(
        name = name,
        **kwargs
    )

def cc_py_binary(name, py_deps = [], **kwargs):
    py_runtime_target = name + "_py_runtime"
    _cc_py_runtime(
        name = py_runtime_target,
        deps = py_deps,
    )

    kwargs.update({
        "data": kwargs.get("data", []) + [":" + py_runtime_target],
        "env": {"__PYVENV_LAUNCHER__": "$(PYTHON3)", "PYTHONPATH": "$(PYTHONPATH)"},
        "toolchains": kwargs.get("toolchains", []) + [":" + py_runtime_target],
    })

    native.cc_binary(
        name = name,
        **kwargs
    )

which can be used as follows:

cc_py_test(
    name = "pybind_embed_test",
    srcs = ["pybind_embed_test.cc"],
    py_deps = ["//some/py:target", "@pypi//numpy:pkg"],
    deps = ["//some/cc:target"],
)
jared2501 commented 1 month ago

@ahojnnes - I literally was looking at this issue earlier in the week, and checked back and you had written exactly what I was trying to write! Thank you so much for posting it.

The only downside to the approach you describe is that if you cc_py_test brings in a transitive dependency you have to make sure to include all the required python packages in the py_deps field.. What I kinda want is being able to construct a cc_py_library that has cc code & declares the required py deps, and then be able to include that cc_py_library in the deps field of a cc_py_test or cc_py_binrary. Thoughts?

ahojnnes commented 1 month ago

@jared2501 I don't think that my rules above have this limitation. You only need to list the imports made in the ccpy{test,binary} sources. Any transitive dependencies should be automatically added to the runfiles and imports. At least, it worked in some minimal tests for me.

keith commented 3 weeks ago

@ahojnnes does your solution avoid the:

terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to get the Python codec of the filesystem encoding

error (with python 3.10) as described above? I think this comes from PYTHONPATH not including the default lib/python3.10 dir? Even for a single dependency, the PYTHONPATH comes through as external/rules_python~~pip~pip_deps_310_numpy/site-packages, which potentially doesn't exist relative to the cwd of the binary?

keith commented 3 weeks ago

ok actually the issue for me I think is bzlmod related, instead of prefixing external/ I needed to prefix ../

JBPennington commented 2 weeks ago

Could someone in this thread cobble together a minimal example of using pybind11's embedded functionality to plot from matplotlib. Similar to this but using pybind11_bazel with hermetic python? This has been a key blocker for me switching to bazel and I can't figure it out.

JBPennington commented 2 weeks ago

I've actually started an example repo here to get two embedded examples to run but I'm still having significant issues.