Iterators efficiency and instance tracking

aldanor commented 8 years ago

In the current implementation, the iterator implementation turns out to be very inefficient at times (to the point where it adds overhead that is order of magnitude bigger than the iterator logic itself).

Iterators with internal reference

Consider a hypothetical example (this would be quite typical for stream-like / input ranges):

struct point { int x; int y; };

struct range {
    int n;
    range(int n) : n(n) {}

    struct iterator {
        int i = 0;
        point p {};
        iterator() = default;
        iterator(int n) : i(n), p{n, n} {}
        const point& operator*() const { return p; }
        iterator& operator++() { --i; p.x = i; p.y = i; return *this; }
        bool operator==(const iterator& it) const { return i == it.i; }
    };

    iterator begin() { return {n}; }
    iterator end() { return {}; }
};

Assuming point has a corresponding py::class_ binding, wrapping this range in a py::make_iterator() with default return value policy of reference_internal and iterating over it in Python will do roughly the following for each iteration step:

advance C++ iterator, check if iteration is over, dereference the C++ iterator
do the cast to Python point instance:
- do a runtime check in the map of registered types to get type_info (why isn't this cached per each py::cpp_function?)
- do a runtime check in the multimap of registered instances to find the instance
- the instance will always be found
- incref it and return

Note that this will essentially yield the same Python object on every iteration, but it will still perform two map lookups every time.

In this example, this could be completely avoided if py::iterator_state cached the resulting py::object (whose ->value always points to the same C++ instance in this example, so it doesn't even need to be reallocated) and not just the current state of begin/end iterators. Instead of doing the cast, it could just incref the object and return the handle.

No internal reference or copy r/v policy

If the C++ iterator returns an object with a different address on each dereference, or if we specify the copy return value policy, things get even worse.

The sequence of steps is now:

advance C++ iterator, check if iteration is over, dereference the C++ iterator
do the cast to Python point instance:
- do a runtime check in the map of registered types to get type_info
- do a runtime check in the multimap of registered instances to find the instance
- the instance will never be found
- allocate new Python instance
- (if copy r/v policy is specified) call C++ object's copy constructor
- bind the Python instance to the C++ object
- record the new instance in the registered instances multimap
- (at some point in future) remove instance from registered multimap

If the downstream Python code doesn't care about yielded values outside of one iteration cycle and doesn't pass them as arguments to other functions, which is quite often the case, i.e. if it's something like this:

sum(p.x * p.x + p.y * p.y for p in points_iterable)

or this:

for p in points_iterable:
    # do some computation, don't save p or pass it as argument anywhere

then registering/unregistering instances adds overhead that is completely unneeded. Plus, between the garbage collection cycles, the registered instances multimap will grow quite fast here which will slow things down even further. Here, the sequence of steps might as well just be:

advance and derefence C++ iterator
allocate new Python instance
bind the C++ object to it

Would it make sense to have a return value policy copy_untracked (cast doesn't do registered instance lookup; dealloc calls dtor and doesn't try to unregister)? Or reference_untracked (cast doesn't do registered instance lookup; dealloc doesn't call dtor and doesn't try to unregister) or something like that? The only catch here is that some sort of flag must be stored in the instance itself, so that the deallocator knows not to try and erase it from registered instances multimap when the time comes.

wjakob commented 8 years ago

Hi Ivan,

I agree that there is a lot of fluff. I'm curious if you've faced performance issues due to this? One minor correction to the first case above: even when copy is used, the object won't be copied if the address matches an existing object.

I'm not sure how the type_info could be cached in a clean way without messing up the separation between type casters and the rest of the codebase. In any case, I suspect that the C++ hashtable lookup is very fast (much faster than a Python dict lookup).

The untracked policies you suggest sound potentially dangerous (though maybe I'm just missing something). Python is free to do whatever it wants with the resulting instances (e.g. stick them into some data structure that persists). How can you ensure correct lifetime management while avoiding leaks in this case?

aldanor commented 8 years ago

@wjakob Yes, the only reason I raised this is because of serious performance issues. I have a C++ codebase that parses some network files and streams parsed data structures (order of tens of millions), with average time spent per packet being around 200ns. When wrapping it in a make_iterator (with the default reference_internal), the performance drops to around 550-600ns, when setting the rvp to copy it drops further to 750-800. All of this time is pybind11 overhead which in this case seems absolutely redundant.

I did a bit of profiling, here's an example of top offenders (this is compiled with something like -O2 -pg):

(pprof) top40
Total: 48298 samples
    2815   5.8%   5.8%     2820   5.8% _int_malloc malloc.c:0
    2480   5.1%  11.0%     2480   5.1% std::_Hash_bytes ??:0
    2301   4.8%  15.7%     6467  13.4% std::_Hashtable::find :0
    1750   3.6%  19.4%     1753   3.6% _int_free malloc.c:0
    1549   3.2%  22.6%     1549   3.2% std::_Hashtable::_M_find_before_node [clone .isra.1086] [clone .lto_priv.2808] :0
    1499   3.1%  25.7%     2519   5.2% std::_Hashtable::equal_range :0
    1408   2.9%  28.6%     2115   4.4% _mcount ??:0
    1369   2.8%  31.4%     2131   4.4% std::vector::vector :0
    1215   2.5%  33.9%     1806   3.7% std::_Hashtable::_M_insert_multi_node :0
    1119   2.3%  36.2%     3939   8.2% __GI___libc_malloc :0
    1054   2.2%  38.4%     1054   2.2% std::_Hashtable::_M_find_before_node [clone .isra.1321] [clone .lto_priv.1320] :0
    1020   2.1%  40.5%     1020   2.1% __memcpy_sse2 :0
    1002   2.1%  42.6%    10985  22.7% StreamAdaptor::adaptor::fetch :0  # <-- C++ iterator "next()"
     899   1.9%  44.5%     1322   2.7% std::_Hashtable::erase :0
     829   1.7%  46.2%    33459  69.3% pybind11::cpp_function::initialize::{lambda#3}::operator [clone .constprop.848] :0
     798   1.7%  47.8%     3763   7.8% cache_next_packet [clone .constprop.1239] :0  # C++ iterator
     792   1.6%  49.5%     2804   5.8% __GI__IO_fread :0

So, out of the top 50% of cumulative time roughly 25% (half) is spent directly in hash maps, but actually it's more than that since a lot of mallocs, frees and memcpys are triggered by hash maps as well.

Hash tables are fast, but not that fast... Multi maps are even worse.

aldanor commented 8 years ago

Re: the untracked data structures, we still attach a deallocator to them though (with the difference being that it wouldn't try to erase from the hash map)? Could you provide a hypothetical example of a leak you were talking about, so we're on the same page?

aldanor commented 8 years ago

Re: caching type_info, wouldn't it be natural for cpp_function to cache it somehow? It knows its own return type, and the types cannot be unregistered => this implies that on the second call (or later), the type_info must already exist, and it may as well be cached for future reuse? It seems a bit wasteful for functions that are executed millions of times to do the same hashmap lookup every time

jagerman commented 8 years ago

Here's a branch with (experimental) return type caching implementation: it passes all the tests, and is indeed being used as intended (it avoids the hash table type lookup for 62/140 return values in the test suite). See how it affects your profiling results; if it looks promising, I'll submit as a PR.

branch / commit

It caches it in function_record rather than cpp_function, and has to do a little intermediate templated class trickery to be able to pass through the cache variable only where we want it (i.e. generic types) without needing to add a cache variable to all type_casters.

aldanor commented 8 years ago

@jagerman Thanks.

I did a bit of testing with a trivial struct with a single int field:

running c++ loop creating and instance and calling the copy constructor on each iteration
calling pybind11 (master branch) function that returns a new instance with rvp::copy
the same as above with @jagerman's branch
running python loop with same number of elements doing nothing (to exclude pure Python overhead)

This is gcc5, -O3 -DNDEBUG, Python 3.5, Linux, E5 2687W, times per iteration:


C++	28ns
python empty loop	89ns
master	342ns
return-type-caching	324ns

=> pybind11 overhead goes from 225ns to 207ns which is approx 8%.

So the proposed commit seems to work as intended (albeit it's slightly clunky that cast now has a fourth parameter which is an optional double pointer, but I guess there's not many options on how to pass a mutable cache...).

Should we also have copy_untracked? :)

wjakob commented 8 years ago

This speedup seems very small compared to the profiler output mentioned earlier. How come?

jagerman commented 8 years ago

Because return value type lookup is the relatively smaller overhead here. Much of the rest of the hashing cost is in the instance lookup, which, for cast, typically requires two hashes for new instances: one to determine whether the instance is known, and when it isn't, another hash when inserting it.

jagerman commented 8 years ago

Also, just for the record, unordered_multimap vs unordered_map isn't incurring any noticeable difference here: you can switch the definition in common.h (no other code changes needed) and see that timings stay essentially unchanged—it breaks tests, of course, but it shouldn't break your example iterator code. (multimap only gets slower when you have a lot of equal key values, which is unlikely to be the case in pybind11—we only get that when one pybind-registered type is the first member of another pybind-registered type.).

aldanor commented 8 years ago

Yea, there's the instance tracking which takes time. Plus this a very simplified example (I'll have to try and rebuild the code I have with these changes to see how it affects real-world large codebase - but may not be able to do so until Monday).

Here's a simple benchmark ("untracked" code can be commented out, it's my local change; adds "tracked" flag to instance wrapper, skips registering on cast and unregistering on dealloc):

// test_cast.cpp
#include <pybind11/pybind11.h>

struct Foo { int v; Foo(int i) : v(i) {} };

PYBIND11_PLUGIN(test_cast) {
    namespace py = pybind11;
    py::module m("test_cast");
    py::class_<Foo>(m, "Foo");
    m.def("mk_copy", [](int  i) -> Foo { return {i}; },
          py::return_value_policy::copy);
    m.def("mk_untracked", [](int  i) -> Foo { return {i}; },
          py::return_value_policy::copy_untracked);
    m.def("ref", [](int N) {
        for (int i = 0; i < N; i++) {
            Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
        }
    });
    return m.ptr();
}

from time import time
from test_cast import mk_copy, mk_untracked, ref

def bench(msg, iterable, norm=1):
    n = 0
    t0 = time()
    for _ in iterable:
        n += 1
    t1 = time()
    n *= norm
    print('{:<15s} calls: {:<10d} elapsed: {:.3f}s       ns/call: {:.1f}ns'
          .format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))

N = 10 * 1000 * 1000

# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [N]), N)

# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))

# return_value_policy::copy_untracked
bench('mk_untracked', (mk_untracked(i) for i in range(N)))

# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))

Using @jagerman's branch:

C++:            calls: 10000000   elapsed: 0.279s       ns/call: 27.9ns
Python:         calls: 10000000   elapsed: 1.587s       ns/call: 158.7ns
mk_untracked:   calls: 10000000   elapsed: 2.573s       ns/call: 257.3ns
mk_copy:        calls: 10000000   elapsed: 3.298s       ns/call: 329.8ns

On master:

mk_copy:        calls: 10000000   elapsed: 3.496s       ns/call: 349.6ns

jagerman commented 8 years ago

@aldanor Re: copy_untracked, how much of an overhead reduction does it give? It ought to be relatively easy to implement (even easier if you just change copy behaviour for testing): skip the instances look-up and emplace in type_caster_generic::cast in cast.h, and delete the if (!found) pybind11_fail(...) from generic_type::dealloc() in pybind11.h. (You might also want to add a bool unregistered to the instance struct so that you can avoid the look-up in dealloc entirely for unregistered types.)

jagerman commented 8 years ago

Okay, there's my answer :)

aldanor commented 8 years ago

@jagerman Yes that exactly what I did :)

wjakob commented 8 years ago

It would be interesting to see a profiler dump for the mk_untracked version in case that is easy to make.

jagerman commented 8 years ago

Something else to think about: would it make more sense applying this do-no-track attribute to the type, rather than the return value? Otherwise I could get into weird behaviour with something like:

p = best(p1, p2)

where best is some sort of max-like function that returns a reference to the better instance. Suppose it returns a reference to p1. I'm going to get different behaviour depending on whether p1 came from a tracked or untracked return value. If p1 was tracked, I'll get p is p1. If untracked, what happens? I think I would get two variables that Python thinks are different, but that are bound to the same C++ instance; is this going to work properly?

If, on the other hand, the don't-track-me is a class property, I'll always get consistent behaviour: every function that returns a Foo (in your example) will be returning untracked copies. (In the best call above, I'd always get a p copied from p1 and p2, regardless of where p1 or p2 came from).

aldanor commented 8 years ago

@jagerman I think this would throw a cast exception though (correct me if I'm wrong): if p1 is an untracked return value, then generic load will fail for the first argument since the object is not listed in registered instances? In other words, you won't be able to pass untracked values as arguments, which is a legit use case.

jagerman commented 8 years ago

I don't think so: load just needs a valid instance, which we still have. Whether it's in registered_instances or not doesn't matter (i.e. under untracked, there is still an instance, it just isn't stored in registered_instances)—registered_instances only really plays a role in making sure that multiple Python references to the same C++ object also show up that way in Python.

aldanor commented 8 years ago

Oh, right. Well, then we can make it fail? If we have a valid instance, and it's not ->tracked we could reject the load? So it would be free to stay in Python but banned from getting back to C++.

jagerman commented 8 years ago

Is there any reason you shouldn't be able to pass these instances to C++ functions (perhaps there's some other overhead I'm not thinking of)? I'd be a bit uncomfortable with the resulting situation if you couldn't: if the python object was created this way, you can't pass it to a function, but if it was created that way, you can.

On the other hand, if the non-copyability applies to all instances of a particular type, the behaviour of that instance is determined by what the instance is, rather than where it came from. Even if there is a compelling reason that these untracked instances shouldn't be passable to C++ functions, it still seems better that no python Foo instance should be passable to C++ rather than some of them based on the object's history.

aldanor commented 8 years ago

@jagerman No other reason, but to me personally it feels like this is actually a return value policy and not a property of a type (e.g. you can have some functions returning untracked instances of a type, but not other ones?). Need to ponder more on this :)

Just out of curiosity (it should probably be done a bit more carefully), I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value: branch / test example.

jagerman commented 8 years ago

I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value

This is a pretty harsh restriction: it means I can't pass a python instance to a function that takes a const Type &, which is probably the vast majority of functions/methods out there that take non-primitive, read-only arguments. It's also sort of hitting a nail with a sledgehammer: the problem isn't really with accessing references from C++, it's with returning references back to Python.

To summarize, just to be sure I understand the issues correctly (correct me if I have anything wrong here), the main issue here is that the following will result in a double-free:

m.def("get", [&]() { return new SomeType(); }, py::return_value_policy::copy_untracked);
m.def("pass_through", [&](SomeType &s) { return &s; })

a = get()
b = pass_through(a)

because the pass_through return value caster doesn't know about the internal C++ instance associated with a, and so it takes over ownership. Without copy_untracked on get, this wouldn't be a problem: the caster would see the already-registered instance and just increment the reference count on it. (This also means a is b is true without copy_untracked, but false with it).

Does that sound about right?

jagerman commented 8 years ago

I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value:

I don't quite understand why, but this breaks a lot of tests (tests are fine with with the untracked commit, but not with the restricting one).

aldanor commented 8 years ago

@jagerman Thanks for taking a look. Yea, as I said, the second commit was just playing around with it -- it could be done more carefully if need be (one of the problems being that by the time it gets to load, the full type information is erased via intrinsic_type, so you have no idea if it used to be a reference or not -- which is a bit unwieldy). The untracked commit should be fine (aside from not bypassing weakref part, as you mentioned in a commit comment).

Looking at your last example -- yea, that's the kind of thing that should be prevented with untracked instances this way or another. And preventing passing the untracked instances as reference arguments may indeed not be the best way to go (although I'm certain it's feasible).

I'm wondering, would your previous suggestion on attaching the "untrackedness" property to the type itself solve this? What kind of limitations would be imposed on such types?

jagerman commented 8 years ago

I'm wondering, would your previous suggestion on attaching the "untrackedness" property to the type itself solve this? What kind of limitations would be imposed on such types?

I think so. Basically, if the type is declared untrackable, everything in the many-argumented cast(...) would simply see tinfo->untrackable and then force a copy, basically ignoring whatever return_value_policy was specified on the function. Then it becomes perfectly safe to let the instance come back to C++: if C++ code tries to return by reference, that gets preempted into a copy. It has its own set of limitations, of course (you can never return a reference to pybind), but it seems not quite as big as disallowing the object from ever coming back into C++ code.

aldanor commented 8 years ago

Yea, I see. So for simple/small POD types (like one in the example) this would make sense, you pay the price of copying (which could be negligible) but avoid hashmap lookup, insert and erase (and with the proposed change in your branch, avoid any hashmap operations at all beyond the first call).

So when it's returned from C++, it's force-copied. But if I understand correctly this does not prohibit it crossing the boundary from Python to C++ as e.g. a mutable lvalue reference? This is good then. In fact, for near-pointer-sized types, this would often be pretty much a free win.

jagerman commented 8 years ago

This is good then. In fact, for near-pointer-sized types, this would often be pretty much a free win.

I think that's the main use case here. If you're writing code that needs to iterate through many large objects, you'll be much better off with a continually updated reference, like in your original reference_internal code from way back in the initial post.

jagerman commented 8 years ago

@aldanor take a look at my only-copy-types branch: it's an implementation of the type-based approach.

I adapted your simple benchmark code as follows: test_cast.cpp:

#include <pybind11/pybind11.h>

struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };

PYBIND11_PLUGIN(test_cast) {
    namespace py = pybind11;
    py::module m("test_cast");
    py::class_<Foo>(m, "Foo", py::always_copy());
    py::class_<Bar>(m, "Bar");
    m.def("mk_copy", [](int i) -> Bar { return {i}; },
        py::return_value_policy::copy);
    m.def("mk_untracked", [](int i) -> Foo { return {i}; });
    m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; },
        py::return_value_policy::reference);
    m.def("ref", [](int N) {
        for (int i = 0; i < N; i++) {
            Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
        }
    });
    return m.ptr();
}

test_cast.py:

from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref

def bench(msg, iterable, norm=1):
    n = 0
    t0 = time()
    for _ in iterable:
        n += 1
    t1 = time()
    n *= norm
    print('{:<15s} calls: {:<10d} elapsed: {:.3f}s       ns/call: {:.1f}ns'
          .format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))

N = 20 * 1000 * 1000

# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)

# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))

# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))

# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))

# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))

which gives the following timings:

C++:            calls: 200000000  elapsed: 4.585s       ns/call: 22.9ns
Python:         calls: 20000000   elapsed: 2.555s       ns/call: 127.8ns
mk_untracked:   calls: 20000000   elapsed: 3.727s       ns/call: 186.4ns
py_ref:         calls: 20000000   elapsed: 3.277s       ns/call: 163.9ns
mk_copy:        calls: 20000000   elapsed: 5.277s       ns/call: 263.8ns

or, applied to master (i.e. without the return value type caching):

C++:            calls: 200000000  elapsed: 4.636s       ns/call: 23.2ns
Python:         calls: 20000000   elapsed: 2.423s       ns/call: 121.2ns
mk_untracked:   calls: 20000000   elapsed: 4.348s       ns/call: 217.4ns
py_ref:         calls: 20000000   elapsed: 3.683s       ns/call: 184.1ns
mk_copy:        calls: 20000000   elapsed: 6.206s       ns/call: 310.3ns

aldanor commented 8 years ago

I think it's an ok compromise if you do want/need to make a copy. For user-facing API, I find having an iterator with internal reference type may be quite confusing, since you will essentially get the same object in Python, and if you start collecting yielded items in a collection the results may be unexpected.

Also, if you want absolutely fastest version with internal reference iterator, you can slightly modify make_iterator and iterator_state so it caches the handle, so you can just incref it on each iteration, this would do no hashmap lookups either.

On my home OS X box, I get slower timings but same ordering:

C++:            calls: 20000000   elapsed: 1.015s       ns/call: 50.8ns
Python:         calls: 2000000    elapsed: 0.332s       ns/call: 166.1ns
mk_untracked:   calls: 2000000    elapsed: 0.589s       ns/call: 294.6ns
py_ref:         calls: 2000000    elapsed: 0.416s       ns/call: 208.2ns
mk_copy:        calls: 2000000    elapsed: 0.884s       ns/call: 442.0ns

What I find a bit strange is that 208 + 50.8 < 294.6, since copying an int-size struct shouldn't take twice as much time than inserting+erasing from a hashmap? Maybe the compiler optimizes something in an unobvious way, or maybe one version is less cache friendly, not sure about that.

// As Wenzel mentioned above, would be nice to pprof this, also to see where the rest of the timing goes in pybind11 (unfortunately, pprof/gperftools doesn't resolve all symbols on osx so if someone on Linux could check it that'd be great; otherwise I'll try to check it next week myself).

jagerman commented 8 years ago

What I find a bit strange is that 208 + 50.8 < 294.6, since copying an int-size struct shouldn't take twice as much time than inserting+erasing from a hashmap?

It has to do more than just that: there's also the python instance initialisation and deallocation, which isn't needed when the registered instance already exists.

jagerman commented 8 years ago

perf profiles:

With all the hashtable stuff gone with always_copy, there's really not much left in terms of obvious pybind11 overhead.

aldanor commented 8 years ago

Thanks for the profiles, that's what I would expect. I was thinking, would this have to be restricted to copying? Does always_move make sense? Or always_copy_or_move?

jagerman commented 8 years ago

I think move support would have to come in via the return_value_policy: so if the object is always_copy, we check for return_value_policy::move and if so, move; otherwise we copy.

aldanor commented 8 years ago

Yes, good point. always_copy might be a bit misleading then though, as a name :)

jagerman commented 8 years ago

Agreed, but always_copy_or_move seems overly verbose.

aldanor commented 8 years ago

untracked / no_track / no_reference / never_reference as alternatives? always_copy is not too bad either though, aside from the slight move rvp confusion.

So, everything looks pretty good, are we going ahead with all this then?

jagerman commented 8 years ago

I vote for no_reference.

I'd sort of like to wait for #385 (which looks like it will depend on #372 for the all_of_t bits), so that this can be declared via

py::class_<Name, py::no_reference>(m, "Name")

which feels like a more natural declaration to me than:

py::class_<Name>(m, "Name", py::no_reference())

(mainly because it keeps the tag with the C++ class).

wjakob commented 8 years ago

I'm a bit lost in the discussion above which considered a variety of possible changes. What is the specific proposal that is under consideration now? Is there candidate code to look at?

jagerman commented 8 years ago

There will be shortly; I have to get #385 updated first as this now depends on that.

jagerman commented 8 years ago

I've updated my only-copy-types branch--it now supports return_value_policy::move, and takes the (renamed) py::no_reference as a class_ template option instead of a constructor parameter.

It doesn't currently documentation or test code committed; these are the current versions of test/benchmark code I've been using (which also show how it works):

test_cast.cpp

#include <pybind11/pybind11.h>
#include <pybind11/functional.h>

struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };

PYBIND11_PLUGIN(test_cast) {
    namespace py = pybind11;
    py::module m("test_cast");
    py::class_<Foo, py::no_reference>(m, "Foo");
    py::class_<Bar>(m, "Bar");
    m.def("mk_copy", [](int i) -> Bar { return {i}; },
        py::return_value_policy::copy);
    m.def("mk_untracked", [](int i) -> Foo { return {i}; });
    m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; }, py::return_value_policy::reference);
    m.def("ref", [](int N) {
        for (int i = 0; i < N; i++) {
            Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
        }
    });
    m.def("ref_lambda", [](int N, py::object o) {
        for (int i = 0; i < N; i++) {
            o(Foo(i));
        }
    });
    return m.ptr();
}

test_cast.py

from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref, ref_lambda

def bench(msg, iterable, norm=1):
    n = 0
    t0 = time()
    for _ in iterable:
        n += 1
    t1 = time()
    n *= norm
    print('{:<15s} calls: {:<10d} elapsed: {:.3f}s       ns/call: {:.1f}ns'
          .format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))

N = 20 * 1000 * 1000

# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)

# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))

# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))

# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))

# Loop in C++ calling python lambda
bench('C++ w/ lambda', (ref_lambda(x, lambda z: z) for x in [N]), N)

# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))

jagerman commented 8 years ago

There are really two separate issues here: "Iterators with internal reference" and "No internal reference or copy r/v policy"; #390 addresses the first one, while only-copy-types is a potential solution to the second.

jagerman commented 8 years ago

@aldanor Just for curiousity, I added a "C++ w/ lambda" case into the test above: it manages to perform slightly better than either the reference/copy approaches.

C++:            calls: 200000000  elapsed: 4.753s       ns/call: 23.8ns
Python:         calls: 20000000   elapsed: 2.373s       ns/call: 118.6ns
mk_untracked:   calls: 20000000   elapsed: 3.712s       ns/call: 185.6ns
py_ref:         calls: 20000000   elapsed: 3.254s       ns/call: 162.7ns
C++ w/ lambda:  calls: 20000000   elapsed: 2.975s       ns/call: 148.8ns
mk_copy:        calls: 20000000   elapsed: 5.695s       ns/call: 284.7ns

aldanor commented 8 years ago

Re: C++ w/ lambda, I guess it's partly because the loop is now in C++ so Python doesn't have to read/write from the locals dict a gazillion times? That, plus no hashmap lookups. Quite surprising it's so fast nonetheless, this is really good stuff :)

jagerman commented 8 years ago

@aldanor I'm not going to have time to finish this off in the next little while, but if you want to clone the branch and submit a PR, please feel free. I think it's pretty much done, it just needs some documentation and tests (and a proper PR).

aldanor commented 8 years ago

@jagerman Alright no worries, I'll try to finish it up, I think it's worth doing, especially if return type cache PR lands as well.

jagerman commented 8 years ago

The return value caching in PR #390 was retracted in favour of including it as part of the eventual PR to address both issues here at once.

pybind / pybind11

Iterators efficiency and instance tracking #376

Iterators with internal reference

No internal reference or copy r/v policy