jagerman commented 7 years ago

679 initially added a `detail/` directory to the header includes. It was withdrawn to be discussed later in a separate issue/PR; this is meant to be the start of discussion. Also read the following not as a done deal, but as a request for comments where I'm mostly thinking out loud to start the discussion.

I think we should make the move for most (or even all?) of our detail namespace code. Basically, all the things in class_support.h are an obvious first start to move to detail/class.h; but there are also many other potential additions:

Detail headers

the various template meta-code in common.h (for things like satisfied_all_of, would fit nicely into a detail/meta.h. It's relatively centralized now (in common.h), but that hasn't always been the case and has led to me, at least, occassionally duplicating meta template functionality that was already implemented (but implemented elsewhere, i.e. immediately before it was used/needed).
The same sort of moving of detail namespace bits into detail-namespace companions would be a nice cleanup to the non-detail code. E.g. pybind11.h's generic_type and init classes, keep alive implementation, etc. are all already in the detail namespace; moving them into a detail directory header would be appropriate.

Type casters

Along a similar line, we could move type caster implementations into their own namespace, pybind11::cast, and directory, pybind11/cast. For now, for backwards compatibility, cast would be a namespace alias of detail, but the plan would be to eventually drop the alias so that all type casters live as pybind11::cast::type_caster<T>, thus having external type casters put themselves in the detail namespace.

Thus you would include pybind11/cast/stl.h for stl casters, pybdin11/cast/eigen.h for eigen casters, etc. (The current pybind11/stl.h, pybind11/eigen.h would just one-line compatibility headers that just include the new location).

Many of the built-in casters would be usefully moved as well:

the arithmetic caster in cast/arithmetic.h
std::string and char* casters to cast/string.h
std::pair/std::tuple into cast/tuple.h
pointer casts, null casters, generic casters I think could remain in cast.h.
the argument loader seems a bit out of place in cast.h—I think it's there mainly for historic reasons, since it was originally part of the std::tuple caster. Perhaps detail/argloader.h?

dean0x7d commented 7 years ago

I definitely agree with both the detail and cast reorganization. I think there's one more thing that might be worth considering as this reorganization is taking place: going from header-only to optionally header-only. For example, see opt-in header-only libraries (but in the case of pybind11, it would be opt-out). I bring this up here because this would influence some of the header organization.

There are more details in the linked blog post, but I'll summarize what it would look like for pybind11 (it's actually slightly simplified since pybind11 would only support header-only and static lib configurations but not dynamic lib).

Optional header-only

This would make it possible to compile pybind11 itself as a static library which would help with compile time. This can be done so that it's 100% backward compatible -- existing users would not notice any change and everything would still function as header-only by default. Compiling as a static library would be opt-in, e.g. by defining PYBIND11_STATIC_LIB.

The main idea is that headers would be split up into the expected .h and .cpp files and follow completely normal C++ development. The only departure would be that all functions would need to annotated with a macro, e.g. PYBIND11_FUNC that looks like this:

#if !defined(PYBIND11_STATIC_LIB)
# define PYBIND11_FUNC inline
#else
# define PYBIND11_FUNC
#endif

This would essentially replace inline which is currently required for all non-template functions.

The header and source files would look like this:

// foo.h
#pragma once

PYBIND11_FUNC void foo();

#if !defined(PYBIND11_STATIC_LIB)
# include "foo.cpp"
#endif

// foo.cpp
#include "foo.h"

PYBIND11_FUNC void foo() {
    // implementation
}

The obvious benefit would be compile time. I made a very quick prototype to measure the speedup that could be expected. I only moved generic_type and cpp_function into the static lib and got the numbers below. Note there is a lot more non-template code that could be moved into the static lib. This is measured by compiling pybind11_tests:

macOS/clang 8.0.0
-----------------
header-only
  time make: 96.01s user 3.03s system 121% cpu 1:21.71 total
  file size: 1460552

static lib (prototype -- most of the library is still header-only)
  time make: 77.64s user 2.79s system 120% cpu 1:06.86 total
  file size: 1468768

--> time: +23% faster compilation
--> size: +0.56% (+8216 bytes) larger binary

macOS/GCC 6.3.0
---------------
header-only
  time make: 130.98s user 6.10s system 99% cpu 2:17.59 total
  file size: 1931124

static lib (prototype -- most of the library is still header-only)
  time make: 102.53s user 5.57s system 99% cpu 1:48.59 total
  file size: 1884812

--> time: +28% faster compilation
--> size: -2.40% (-46312 bytes) smaller binary

The performance would still go up as more of the files split up into .h/.cpp pairs. The speedup also becomes larger as the number of translation units is increased, as would be expected. The binary size goes up with clang, but down with GCC. I'd expect that should go down slightly and the regression with clang is most likely due to the quick prototype that this was measured with.

Most of the .h/.cpp split-up is fairly straightforward. It would just need to be coordinated with the reorganization above to avoid moving the files twice.

Thoughts?

aldanor commented 7 years ago

Interesting idea re: header-only opt-in; I recall a few libraries where this is done quite well, like fmt.

If done properly, this will probably involve a bit more work than a mere .h/.cpp split, like -- considering which templates to extern, maybe a few commonly used specialisations could get precompiled; considering whether the guts of some heavier types could be pimpled away; moving the type-parameter-independent functionality out of template classes (luckily, pybind11 development has always been very focused on binary size, so this is already satisfied for the most part).

wjakob commented 7 years ago

@dean0x7d I think that pretty much the only benefactors would be cpp_function and the generic_type_caster as well as some initialization code, the other big and complicated parts are all templates. 25% sounds nice, but I don't think it's enough for me to get super-excited about this (especially since the change is quite intrusive).

cirosantilli2 commented 4 years ago

@wjakob firstly hello Wenzel and contributors and thank you for this awesome software.

I'm working on the system simulator gem5 which makes heavy use of pybind11.

I was feeling that the build was a bit slower than what seemed reasonable, and I started to poke a bit, and it seemed that pybind11, or at least the way we are using it, was responsible for a large part of the problem.

I then tried to simply split pybind11 into header and cpp (we currently have an in tree copy of pybind11 2.4.1), and that alone reduced our full build time by 40% from 25 minutes to 15 minutes, which is, needless to say, a game changing improvement for our project as described at: https://gem5.atlassian.net/browse/GEM5-572

Because of this I was wondering, is there any chance you would review a patch that does such a split, or is there anyone with an ongoing patch?

It would be of course done in a way along what @dean0x7d suggested in an optional opt-in way that would not affect existing users.

Or alternatively, if you see any alternative that does not require modifying pybind11 source, also do let us know. Someone proposed precompiled headers, but I'm not sure the gains would be as significant (haven't tried it yet).

To give a clearer explanation of how gem5 uses pybind11, we basically just use it as a way to pass configuration parameters from python to C++. Our full build has 1271 object files, and 365 of them are simple auto-generated C++ files that include pybind11 to contain parameters for different C++ classes, and those 365 files dominate build times.

One such sample typical file looks like:

#include "pybind11/pybind11.h"
#include "pybind11/stl.h"

#include "params/ThermalReference.hh"
#include "python/pybind11/core.hh"
#include "sim/init.hh"
#include "sim/sim_object.hh"

#include "sim/power/thermal_model.hh"

namespace py = pybind11;

static void
module_init(py::module &m_internal)
{
    py::module m = m_internal.def_submodule("param_ThermalReference");
    py::class_<ThermalReferenceParams, SimObjectParams, std::unique_ptr<ThermalReferenceParams, py::nodelete>>(m, "ThermalReferenceParams")
        .def(py::init<>())
        .def("create", &ThermalReferenceParams::create)
        .def_readwrite("temperature", &ThermalReferenceParams::temperature)
        ;

    py::class_<ThermalReference, SimObject, std::unique_ptr<ThermalReference, py::nodelete>>(m, "ThermalReference")
        .def("setNode", &ThermalReference::setNode)
        ;

}

static EmbeddedPyBind embed_obj("ThermalReference", module_init, "SimObject");

so we can see that each file individually is pretty simple, and the problem really seems to come from redefining common pybind11 objects 365 times.

As a random fact :-) I also saw that one of the reasons why pytorch does not use pybind11 is the build time impact: https://discuss.pytorch.org/t/how-are-python-bindings-created/46453/2?u=cirosantilli so maybe other big projects would also benefit from this.

carlsonmark commented 4 years ago

I also generate pybinds automatically, and was aware of the build time problem from using luabind in the past. I chose to generate one large file with all of my pybinds in it, instead of a bunch of smaller files. This seems to work pretty well, but since I did not generate separate files first, I don't really have anything to compare to.

Just a thought, in case it would work for someone else.

cirosantilli2 commented 4 years ago

@carlsonmark yes, this is one of the possibilities we are considering if it is decided that the split should not happen yet. Hopefully we can avoid that though to prevent slow rebuilds when touching one of the generation inputs.

pybind / pybind11

Header reorganization #708

Detail headers

Type casters

Optional header-only