Open aldanor opened 8 years ago
Hi Ivan,
I agree that there is a lot of fluff. I'm curious if you've faced performance issues due to this?
One minor correction to the first case above: even when copy
is used, the object won't be copied if the address matches an existing object.
I'm not sure how the type_info
could be cached in a clean way without messing up the separation between type casters and the rest of the codebase. In any case, I suspect that the C++ hashtable lookup is very fast (much faster than a Python dict lookup).
The untracked policies you suggest sound potentially dangerous (though maybe I'm just missing something). Python is free to do whatever it wants with the resulting instances (e.g. stick them into some data structure that persists). How can you ensure correct lifetime management while avoiding leaks in this case?
@wjakob Yes, the only reason I raised this is because of serious performance issues. I have a C++ codebase that parses some network files and streams parsed data structures (order of tens of millions), with average time spent per packet being around 200ns. When wrapping it in a make_iterator (with the default reference_internal), the performance drops to around 550-600ns, when setting the rvp to copy it drops further to 750-800. All of this time is pybind11 overhead which in this case seems absolutely redundant.
I did a bit of profiling, here's an example of top offenders (this is compiled with something like -O2 -pg):
(pprof) top40
Total: 48298 samples
2815 5.8% 5.8% 2820 5.8% _int_malloc malloc.c:0
2480 5.1% 11.0% 2480 5.1% std::_Hash_bytes ??:0
2301 4.8% 15.7% 6467 13.4% std::_Hashtable::find :0
1750 3.6% 19.4% 1753 3.6% _int_free malloc.c:0
1549 3.2% 22.6% 1549 3.2% std::_Hashtable::_M_find_before_node [clone .isra.1086] [clone .lto_priv.2808] :0
1499 3.1% 25.7% 2519 5.2% std::_Hashtable::equal_range :0
1408 2.9% 28.6% 2115 4.4% _mcount ??:0
1369 2.8% 31.4% 2131 4.4% std::vector::vector :0
1215 2.5% 33.9% 1806 3.7% std::_Hashtable::_M_insert_multi_node :0
1119 2.3% 36.2% 3939 8.2% __GI___libc_malloc :0
1054 2.2% 38.4% 1054 2.2% std::_Hashtable::_M_find_before_node [clone .isra.1321] [clone .lto_priv.1320] :0
1020 2.1% 40.5% 1020 2.1% __memcpy_sse2 :0
1002 2.1% 42.6% 10985 22.7% StreamAdaptor::adaptor::fetch :0 # <-- C++ iterator "next()"
899 1.9% 44.5% 1322 2.7% std::_Hashtable::erase :0
829 1.7% 46.2% 33459 69.3% pybind11::cpp_function::initialize::{lambda#3}::operator [clone .constprop.848] :0
798 1.7% 47.8% 3763 7.8% cache_next_packet [clone .constprop.1239] :0 # C++ iterator
792 1.6% 49.5% 2804 5.8% __GI__IO_fread :0
So, out of the top 50% of cumulative time roughly 25% (half) is spent directly in hash maps, but actually it's more than that since a lot of mallocs, frees and memcpys are triggered by hash maps as well.
Hash tables are fast, but not that fast... Multi maps are even worse.
Re: the untracked data structures, we still attach a deallocator to them though (with the difference being that it wouldn't try to erase from the hash map)? Could you provide a hypothetical example of a leak you were talking about, so we're on the same page?
Re: caching type_info, wouldn't it be natural for cpp_function
to cache it somehow? It knows its own return type, and the types cannot be unregistered => this implies that on the second call (or later), the type_info must already exist, and it may as well be cached for future reuse? It seems a bit wasteful for functions that are executed millions of times to do the same hashmap lookup every time
Here's a branch with (experimental) return type caching implementation: it passes all the tests, and is indeed being used as intended (it avoids the hash table type lookup for 62/140 return values in the test suite). See how it affects your profiling results; if it looks promising, I'll submit as a PR.
It caches it in function_record
rather than cpp_function
, and has to do a little intermediate templated class trickery to be able to pass through the cache variable only where we want it (i.e. generic types) without needing to add a cache variable to all type_casters.
@jagerman Thanks.
I did a bit of testing with a trivial struct with a single int field:
This is gcc5, -O3 -DNDEBUG, Python 3.5, Linux, E5 2687W, times per iteration:
C++ | 28ns |
python empty loop | 89ns |
master | 342ns |
return-type-caching | 324ns |
=> pybind11 overhead goes from 225ns to 207ns which is approx 8%.
So the proposed commit seems to work as intended (albeit it's slightly clunky that cast
now has a fourth parameter which is an optional double pointer, but I guess there's not many options on how to pass a mutable cache...).
Should we also have copy_untracked
? :)
This speedup seems very small compared to the profiler output mentioned earlier. How come?
Because return value type lookup is the relatively smaller overhead here. Much of the rest of the hashing cost is in the instance lookup, which, for cast
, typically requires two hashes for new instances: one to determine whether the instance is known, and when it isn't, another hash when inserting it.
Also, just for the record, unordered_multimap
vs unordered_map
isn't incurring any noticeable difference here: you can switch the definition in common.h
(no other code changes needed) and see that timings stay essentially unchanged—it breaks tests, of course, but it shouldn't break your example iterator code. (multimap only gets slower when you have a lot of equal key values, which is unlikely to be the case in pybind11—we only get that when one pybind-registered type is the first member of another pybind-registered type.).
Yea, there's the instance tracking which takes time. Plus this a very simplified example (I'll have to try and rebuild the code I have with these changes to see how it affects real-world large codebase - but may not be able to do so until Monday).
Here's a simple benchmark ("untracked" code can be commented out, it's my local change; adds "tracked" flag to instance wrapper, skips registering on cast and unregistering on dealloc):
// test_cast.cpp
#include <pybind11/pybind11.h>
struct Foo { int v; Foo(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo>(m, "Foo");
m.def("mk_copy", [](int i) -> Foo { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; },
py::return_value_policy::copy_untracked);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
return m.ptr();
}
from time import time
from test_cast import mk_copy, mk_untracked, ref
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 10 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [N]), N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# return_value_policy::copy_untracked
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))
Using @jagerman's branch:
C++: calls: 10000000 elapsed: 0.279s ns/call: 27.9ns
Python: calls: 10000000 elapsed: 1.587s ns/call: 158.7ns
mk_untracked: calls: 10000000 elapsed: 2.573s ns/call: 257.3ns
mk_copy: calls: 10000000 elapsed: 3.298s ns/call: 329.8ns
On master:
mk_copy: calls: 10000000 elapsed: 3.496s ns/call: 349.6ns
@aldanor Re: copy_untracked
, how much of an overhead reduction does it give? It ought to be relatively easy to implement (even easier if you just change copy
behaviour for testing): skip the instances look-up and emplace in type_caster_generic::cast
in cast.h
, and delete the if (!found) pybind11_fail(...)
from generic_type::dealloc()
in pybind11.h
. (You might also want to add a bool unregistered
to the instance
struct so that you can avoid the look-up in dealloc
entirely for unregistered types.)
Okay, there's my answer :)
@jagerman Yes that exactly what I did :)
It would be interesting to see a profiler dump for the mk_untracked
version in case that is easy to make.
Something else to think about: would it make more sense applying this do-no-track attribute to the type, rather than the return value? Otherwise I could get into weird behaviour with something like:
p = best(p1, p2)
where best
is some sort of max-like function that returns a reference to the better instance. Suppose it returns a reference to p1
. I'm going to get different behaviour depending on whether p1
came from a tracked or untracked return value. If p1
was tracked, I'll get p is p1
. If untracked, what happens? I think I would get two variables that Python thinks are different, but that are bound to the same C++ instance; is this going to work properly?
If, on the other hand, the don't-track-me is a class property, I'll always get consistent behaviour: every function that returns a Foo
(in your example) will be returning untracked copies. (In the best
call above, I'd always get a p
copied from p1
and p2
, regardless of where p1
or p2
came from).
@jagerman I think this would throw a cast exception though (correct me if I'm wrong): if p1
is an untracked return value, then generic load
will fail for the first argument since the object is not listed in registered instances? In other words, you won't be able to pass untracked values as arguments, which is a legit use case.
I don't think so: load
just needs a valid instance
, which we still have. Whether it's in registered_instances
or not doesn't matter (i.e. under untracked
, there is still an instance, it just isn't stored in registered_instances
)—registered_instances
only really plays a role in making sure that multiple Python references to the same C++ object also show up that way in Python.
Oh, right. Well, then we can make it fail? If we have a valid instance, and it's not ->tracked
we could reject the load? So it would be free to stay in Python but banned from getting back to C++.
Is there any reason you shouldn't be able to pass these instances to C++ functions (perhaps there's some other overhead I'm not thinking of)? I'd be a bit uncomfortable with the resulting situation if you couldn't: if the python object was created this way, you can't pass it to a function, but if it was created that way, you can.
On the other hand, if the non-copyability applies to all instances of a particular type, the behaviour of that instance is determined by what the instance is, rather than where it came from. Even if there is a compelling reason that these untracked instances shouldn't be passable to C++ functions, it still seems better that no python Foo
instance should be passable to C++ rather than some of them based on the object's history.
@jagerman No other reason, but to me personally it feels like this is actually a return value policy and not a property of a type (e.g. you can have some functions returning untracked instances of a type, but not other ones?). Need to ponder more on this :)
Just out of curiosity (it should probably be done a bit more carefully), I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value: branch / test example.
I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value
This is a pretty harsh restriction: it means I can't pass a python instance to a function that takes a const Type &
, which is probably the vast majority of functions/methods out there that take non-primitive, read-only arguments. It's also sort of hitting a nail with a sledgehammer: the problem isn't really with accessing references from C++, it's with returning references back to Python.
To summarize, just to be sure I understand the issues correctly (correct me if I have anything wrong here), the main issue here is that the following will result in a double-free:
m.def("get", [&]() { return new SomeType(); }, py::return_value_policy::copy_untracked);
m.def("pass_through", [&](SomeType &s) { return &s; })
a = get()
b = pass_through(a)
because the pass_through
return value caster doesn't know about the internal C++ instance associated with a
, and so it takes over ownership. Without copy_untracked
on get
, this wouldn't be a problem: the caster would see the already-registered instance and just increment the reference count on it. (This also means a is b
is true without copy_untracked, but false with it).
Does that sound about right?
I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value:
I don't quite understand why, but this breaks a lot of tests (tests are fine with with the untracked commit, but not with the restricting one).
@jagerman Thanks for taking a look. Yea, as I said, the second commit was just playing around with it -- it could be done more carefully if need be (one of the problems being that by the time it gets to load
, the full type information is erased via intrinsic_type
, so you have no idea if it used to be a reference or not -- which is a bit unwieldy). The untracked commit should be fine (aside from not bypassing weakref part, as you mentioned in a commit comment).
Looking at your last example -- yea, that's the kind of thing that should be prevented with untracked instances this way or another. And preventing passing the untracked instances as reference arguments may indeed not be the best way to go (although I'm certain it's feasible).
I'm wondering, would your previous suggestion on attaching the "untrackedness" property to the type itself solve this? What kind of limitations would be imposed on such types?
I'm wondering, would your previous suggestion on attaching the "untrackedness" property to the type itself solve this? What kind of limitations would be imposed on such types?
I think so. Basically, if the type is declared untrackable, everything in the many-argumented cast(...)
would simply see tinfo->untrackable
and then force a copy, basically ignoring whatever return_value_policy
was specified on the function. Then it becomes perfectly safe to let the instance come back to C++: if C++ code tries to return by reference, that gets preempted into a copy. It has its own set of limitations, of course (you can never return a reference to pybind), but it seems not quite as big as disallowing the object from ever coming back into C++ code.
Yea, I see. So for simple/small POD types (like one in the example) this would make sense, you pay the price of copying (which could be negligible) but avoid hashmap lookup, insert and erase (and with the proposed change in your branch, avoid any hashmap operations at all beyond the first call).
So when it's returned from C++, it's force-copied. But if I understand correctly this does not prohibit it crossing the boundary from Python to C++ as e.g. a mutable lvalue reference? This is good then. In fact, for near-pointer-sized types, this would often be pretty much a free win.
This is good then. In fact, for near-pointer-sized types, this would often be pretty much a free win.
I think that's the main use case here. If you're writing code that needs to iterate through many large objects, you'll be much better off with a continually updated reference, like in your original reference_internal
code from way back in the initial post.
@aldanor take a look at my only-copy-types branch: it's an implementation of the type-based approach.
I adapted your simple benchmark code as follows: test_cast.cpp:
#include <pybind11/pybind11.h>
struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo>(m, "Foo", py::always_copy());
py::class_<Bar>(m, "Bar");
m.def("mk_copy", [](int i) -> Bar { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; });
m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; },
py::return_value_policy::reference);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
return m.ptr();
}
test_cast.py:
from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 20 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))
which gives the following timings:
C++: calls: 200000000 elapsed: 4.585s ns/call: 22.9ns
Python: calls: 20000000 elapsed: 2.555s ns/call: 127.8ns
mk_untracked: calls: 20000000 elapsed: 3.727s ns/call: 186.4ns
py_ref: calls: 20000000 elapsed: 3.277s ns/call: 163.9ns
mk_copy: calls: 20000000 elapsed: 5.277s ns/call: 263.8ns
or, applied to master (i.e. without the return value type caching):
C++: calls: 200000000 elapsed: 4.636s ns/call: 23.2ns
Python: calls: 20000000 elapsed: 2.423s ns/call: 121.2ns
mk_untracked: calls: 20000000 elapsed: 4.348s ns/call: 217.4ns
py_ref: calls: 20000000 elapsed: 3.683s ns/call: 184.1ns
mk_copy: calls: 20000000 elapsed: 6.206s ns/call: 310.3ns
I think it's an ok compromise if you do want/need to make a copy. For user-facing API, I find having an iterator with internal reference type may be quite confusing, since you will essentially get the same object in Python, and if you start collecting yielded items in a collection the results may be unexpected.
Also, if you want absolutely fastest version with internal reference iterator, you can slightly modify make_iterator
and iterator_state
so it caches the handle
, so you can just incref it on each iteration, this would do no hashmap lookups either.
On my home OS X box, I get slower timings but same ordering:
C++: calls: 20000000 elapsed: 1.015s ns/call: 50.8ns
Python: calls: 2000000 elapsed: 0.332s ns/call: 166.1ns
mk_untracked: calls: 2000000 elapsed: 0.589s ns/call: 294.6ns
py_ref: calls: 2000000 elapsed: 0.416s ns/call: 208.2ns
mk_copy: calls: 2000000 elapsed: 0.884s ns/call: 442.0ns
What I find a bit strange is that 208 + 50.8 < 294.6, since copying an int-size struct shouldn't take twice as much time than inserting+erasing from a hashmap? Maybe the compiler optimizes something in an unobvious way, or maybe one version is less cache friendly, not sure about that.
// As Wenzel mentioned above, would be nice to pprof this, also to see where the rest of the timing goes in pybind11 (unfortunately, pprof/gperftools doesn't resolve all symbols on osx so if someone on Linux could check it that'd be great; otherwise I'll try to check it next week myself).
What I find a bit strange is that 208 + 50.8 < 294.6, since copying an int-size struct shouldn't take twice as much time than inserting+erasing from a hashmap?
It has to do more than just that: there's also the python instance initialisation and deallocation, which isn't needed when the registered instance already exists.
perf
profiles:
With all the hashtable stuff gone with always_copy, there's really not much left in terms of obvious pybind11 overhead.
Thanks for the profiles, that's what I would expect. I was thinking, would this have to be restricted to copying? Does always_move
make sense? Or always_copy_or_move
?
I think move support would have to come in via the return_value_policy: so if the object is always_copy
, we check for return_value_policy::move
and if so, move; otherwise we copy.
Yes, good point. always_copy
might be a bit misleading then though, as a name :)
Agreed, but always_copy_or_move
seems overly verbose.
untracked
/ no_track
/ no_reference
/ never_reference
as alternatives? always_copy
is not too bad either though, aside from the slight move rvp confusion.
So, everything looks pretty good, are we going ahead with all this then?
I vote for no_reference
.
I'd sort of like to wait for #385 (which looks like it will depend on #372 for the all_of_t
bits), so that this can be declared via
py::class_<Name, py::no_reference>(m, "Name")
which feels like a more natural declaration to me than:
py::class_<Name>(m, "Name", py::no_reference())
(mainly because it keeps the tag with the C++ class).
I'm a bit lost in the discussion above which considered a variety of possible changes. What is the specific proposal that is under consideration now? Is there candidate code to look at?
There will be shortly; I have to get #385 updated first as this now depends on that.
I've updated my only-copy-types branch--it now supports return_value_policy::move
, and takes the (renamed) py::no_reference
as a class_
template option instead of a constructor parameter.
It doesn't currently documentation or test code committed; these are the current versions of test/benchmark code I've been using (which also show how it works):
test_cast.cpp
#include <pybind11/pybind11.h>
#include <pybind11/functional.h>
struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo, py::no_reference>(m, "Foo");
py::class_<Bar>(m, "Bar");
m.def("mk_copy", [](int i) -> Bar { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; });
m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; }, py::return_value_policy::reference);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
m.def("ref_lambda", [](int N, py::object o) {
for (int i = 0; i < N; i++) {
o(Foo(i));
}
});
return m.ptr();
}
test_cast.py
from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref, ref_lambda
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 20 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))
# Loop in C++ calling python lambda
bench('C++ w/ lambda', (ref_lambda(x, lambda z: z) for x in [N]), N)
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N)))
There are really two separate issues here: "Iterators with internal reference" and "No internal reference or copy r/v policy"; #390 addresses the first one, while only-copy-types is a potential solution to the second.
@aldanor Just for curiousity, I added a "C++ w/ lambda" case into the test above: it manages to perform slightly better than either the reference/copy approaches.
C++: calls: 200000000 elapsed: 4.753s ns/call: 23.8ns
Python: calls: 20000000 elapsed: 2.373s ns/call: 118.6ns
mk_untracked: calls: 20000000 elapsed: 3.712s ns/call: 185.6ns
py_ref: calls: 20000000 elapsed: 3.254s ns/call: 162.7ns
C++ w/ lambda: calls: 20000000 elapsed: 2.975s ns/call: 148.8ns
mk_copy: calls: 20000000 elapsed: 5.695s ns/call: 284.7ns
Re: C++ w/ lambda, I guess it's partly because the loop is now in C++ so Python doesn't have to read/write from the locals dict a gazillion times? That, plus no hashmap lookups. Quite surprising it's so fast nonetheless, this is really good stuff :)
@aldanor I'm not going to have time to finish this off in the next little while, but if you want to clone the branch and submit a PR, please feel free. I think it's pretty much done, it just needs some documentation and tests (and a proper PR).
@jagerman Alright no worries, I'll try to finish it up, I think it's worth doing, especially if return type cache PR lands as well.
The return value caching in PR #390 was retracted in favour of including it as part of the eventual PR to address both issues here at once.
In the current implementation, the iterator implementation turns out to be very inefficient at times (to the point where it adds overhead that is order of magnitude bigger than the iterator logic itself).
Iterators with internal reference
Consider a hypothetical example (this would be quite typical for stream-like / input ranges):
Assuming
point
has a correspondingpy::class_
binding, wrapping this range in apy::make_iterator()
with default return value policy ofreference_internal
and iterating over it in Python will do roughly the following for each iteration step:point
instance:type_info
(why isn't this cached per eachpy::cpp_function
?)Note that this will essentially yield the same Python object on every iteration, but it will still perform two map lookups every time.
In this example, this could be completely avoided if
py::iterator_state
cached the resultingpy::object
(whose->value
always points to the same C++ instance in this example, so it doesn't even need to be reallocated) and not just the current state of begin/end iterators. Instead of doing the cast, it could just incref the object and return the handle.No internal reference or copy r/v policy
If the C++ iterator returns an object with a different address on each dereference, or if we specify the copy return value policy, things get even worse.
The sequence of steps is now:
point
instance:type_info
If the downstream Python code doesn't care about yielded values outside of one iteration cycle and doesn't pass them as arguments to other functions, which is quite often the case, i.e. if it's something like this:
or this:
then registering/unregistering instances adds overhead that is completely unneeded. Plus, between the garbage collection cycles, the registered instances multimap will grow quite fast here which will slow things down even further. Here, the sequence of steps might as well just be:
Would it make sense to have a return value policy
copy_untracked
(cast doesn't do registered instance lookup; dealloc calls dtor and doesn't try to unregister)? Orreference_untracked
(cast doesn't do registered instance lookup; dealloc doesn't call dtor and doesn't try to unregister) or something like that? The only catch here is that some sort of flag must be stored in the instance itself, so that the deallocator knows not to try and erase it from registered instances multimap when the time comes.