Closed jhjourdan closed 4 years ago
I like the new API. One tiny suggestion: it might be a good idea to make alloc_info
into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call it allocation
, but that's totally subjective.
I like the new API. One tiny suggestion: it might be a good idea to make
alloc_info
into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call itallocation
, but that's totally subjective.
What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API? As far as I know, private record can be extended without compatibility issues.
This is version is quite nice. However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.
One other nice addition would be to allow the user to add a value to the tracked list. But this can be done later, outside of this PR.
What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API?
A private record is good too (although I don't think the list of accessors is actually longer than the record definition).
@chambart
However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.
This access to the value as an Obj.t
was part of a previous iteration of the API, and it was removed to simplify the API. There were at last the following concerns with that part of the API:
At the time where the callback is invoked, the Obj.t
value is invalid: the memory space has been allocated (with a valid header) but the value itself is still missing -- in particular, it is not a valid memory representation at the value's type. It would be possible to store the Obj.t
somewhere and use it later, after it has been initialized, but there is no way using the API to tell when that initialization has occurred, so you have to poke at the Obj.t
value using the Obj
API to try to guess whether it is a valid value, and which type it has. This seems deeply unsafe and fairly error-prone.
If I understand correctly, the trick of calling the callback after the value is allocated only works for allocations in the major heap (so callback_alloc_major
and callback_promote
), not for allocations in the minor heap; @jhjourdan thus had different allocation_info
types for minor and major allocations, which made the API irritatingly less uniform.
At the time we discussed this (last Tuesday) we knew of no actual client making use of this Obj.t
value. (One use-case we discussed is someone trying to analyze a program where most allocations happen at a very specific type whose memory representation is easily inspectable and contains useful information (typically, a meaningful unique identifier), but we didn't have any concrete example in mind.)
Given all this it seemed reasonable to remove the Obj.t
access from this iteration of the API.
.
I would be curious to understand your proposed use-case. Your idea is that at a given time in the program execution we can trace the whole live heap, record a path from memory roots to each value, and store that path for those values that are currently being sampled?
The callbacks let you return a None -- in fact it looks like the previous API also did, I had just never wondered about it. What's the semantics if you return None, the value is not instrumented, but does this affect the sampling in any way? (How do we reason about uniformity in this case?)
As documented:
If such a callback returns [None], then the tracking of this particular block is cancelled.
As for uniformity: if you use this mechanism to filter out some blocks, then you sample according to the probability distribution of blocks that satisfy the given filter. You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.
You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.
Another use case would be sampling allocations without caring about how long they were live. Spacetime's viewer has a graph of just allocations over time and it is often quite useful.
Apparently, one test fails with MinGW32, with a weird error code. This looks like a segfault, but I cannot reproduce on linux (even with Valgrind with the debug runtime).
@gasche I don't need to have access to the value itself, but to be able to follow a specific value. For instance a fresh int would be sufficient, and would in fact be easier to follow than an Obj.t. My use case would be to produce a dump of the heap and be able to point in the memory graph which are the values followed by memprof.
Otherwise a specific type which usually matter a lot for memory usage are bigarrays, and can be worth following specifically and introspecting. Given that this comes from C side allocations, it could have a different API that triggers when returning from C to OCaml, where we know that the value should be completely initialized. But this seems outside of the scope of that PR.
Maybe we could have a generic approach to track custom values, not just bigarrays; in particular, we could use their "virtual size" to decide how to sample them. This is indeed outside the scope of the present PR.
@stedolan : gentle ping. Are you aware that this PR exists? Are you planning to do a review?
@jhjourdan Yes and yes! Apologies for the delay, I've been pretty busy (I was visiting the JS NYC office over the last two weeks, just back now).
The new API looks great. (I'm particularly looking forward to sampling blocks that get promoted).
I haven't finished reading the implementations yet, but abstractly I prefer the malloc-based version, both for performance and because it has less impact on the GC behaviour of the program being traced.
I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c
) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c
. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)
Ephemerons are likely to be dropped from multicore OCaml, since there is no easy way to make them work in the new parallel GC.
To clarify: Ephemerons are supported in multicore, as @kayceesrk's already done the hard work of making them work. However, I would like to revisit the API (particularly the mutable bits like set_key), since the current one causes quite a lot of headache in multicore. So, I'd much prefer if the Memprof API didn't depend on mutating the key of an ephemeron.
Here is the relevant issue in multicore repo that discusses the ephemerons implementation https://github.com/ocaml-multicore/ocaml-multicore/issues/88
But as @stedolan pointed out, I’d much prefer not using set_key as a cleaner, alternative API does not provide set_key.
I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)
I did not try this, but I don't think this will simplify the code significantly. Much of the complexity is in a data structure which makes it possible to:
All this complexity already arises with promotion only, and delegating deallocation to finalisers will only marginally simplify the code.
Moreover, note that the semantics for finilisers is slightly different: a finiliser is a closure attached to a block, while memprof attaches some user data that has to be given to a globally defined closure. Resetting this globally defined closure cancels the tracking of the corresponding blocks, which is not something which is currently possible with finalisers. That's not an untrackable issue with your approach, but at least this is an annoyance.
Ah, now that I think about it, I understand what you mean by reusing the mechanism of finalizers. You will have to extend finilizers to support "promotion finalizers", and do some tricks (using references?) to allow threading user data through the various callbakcs.
This can certainly work, that's effectively an intermediary solution between the malloc-based and the heap-based solution I proposed, since some of the data structures will be in the OCaml heap (the references used to thread the user data) and some will be in the malloc heap (the finalizers information). This will probably have more overhead (since data structures would be allocated both on the OCaml and the malloc heaps), but that might be acceptable.
Feel free to give it a try. I don't forsee any more difficulties than the ones I presented above.
Thanks for the explanation! I'll have a go at the finaliser-based one today and see how it looks.
I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.
Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.
Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:
short, untracked | short, tracked | long, untracked | long, tracked | |
---|---|---|---|---|
09c16c7a0f | 0.88 | 0.93 | 2.81 | 3.03 |
6e035594fc | 0.85 | 0.85 | 2.81 | 2.94 |
trunk | 0.84 | 1.30 | 2.72 | 5.69 |
Benchmark code here. All numbers are median-of-5 executions in seconds.
Edit: I've just added numbers for trunk, using the previous ephemeron-based API. The 'untracked' versions do not create an ephemeron, while the 'tracked' versions create an ephemeron and store the last 16k ephemerons in an array. (If the ephemeron is just discarded, then it's much faster as the ephemeron itself gets collected. But then no actual lifetime tracking gets done).
With the trunk numbers, either implementation of this API seems to be much, much faster than using ephemerons to track lifetimes.
I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.
Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.
Thanks for this attempt. Indeed, this is particularly shorter with this single array of tracked blocks. However, as is, I have several concerns:
caml_memprof_handle_postponed
needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len);
on line 396 and i == tracked.len - 1
on line 427.caml_memprof_oldify_young_roots
, you start the traversal at tracked.young
, but don't update tracked.young
in run_callbacks
when updating user_data
. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data
, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data
list.Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:
Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).
In a multithreaded setting, a context switch can happen during a callback. As a result, the function caml_memprof_handle_postponed needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len); on line 396 and i == tracked.len - 1 on line 427.
Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.
In caml_memprof_oldify_young_roots, you start the traversal at tracked.young, but don't update tracked.young in run_callbacks when updating user_data. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data list.
Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.
Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).
Agreed. I'll do some debugging and see how it goes. A performance improvement is nice, but my main motivation here is shorter code.
The other feature I haven't implemented yet is handling of exceptions raised from memprof callbacks. I'm wondering what the right thing to do here is: is there any reason to want these? Shouldn't they just be a fatal error?
I think the simplest approach is to make memprof callbacks single-threaded, using a lock
Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario.
The most common memprof callback is to just return/store the backtrace. We can expect most callbacks to be performing aggregation, which is easy to do in a thread-local way and gather at statistics-production time. This sounds like a problem domain where thread-safety is natural.
(Of course it also makes sense to aim for implementation simplicity at first, and refine the implementation if we do observe a bottleneck in practice.)
Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.
That would work for current OCaml, but this could result in a performance penalty in Multicore if the sampling rate is set too large. In any case, if we do that, we should take care to suspend the current thread if we try to reentrantly run callbacks rather than postponning until the lock get unlocked, because the thread currently running the callback may very well never get control back.
Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.
Yes, but this means re-traversing (and possibly shifting) a large part of the array of tracked block at the next minor collection, which can be costly. Remember that this array can easily be rather large (> 10^5 entries) for large heaps (> 1Go) and reasonable sampling rates (1/1000).
I've pushed the fixes to https://github.com/stedolan/ocaml/tree/memprof_new_api (with tests)
@jhjourdan
Yes, but this means re-traversing (and possibly shifting) a large part of the array of tracked block at the next minor collection, which can be costly. Remember that this array can easily be rather large (> 10^5 entries) for large heaps (> 1Go) and reasonable sampling rates (1/1000).
It won't do much traversal, even when the array is large. The whole array is only scanned every major GC cycle (when callback
is reset to 0, triggering a rescan of the whole array to check if deallocation callbacks need to run). During minor collections only a suffix of the array is scanned. The length of this suffix is bounded by the number of samples on the minor heap, or the number of allocation / promotion callbacks called since the last minor GC (whichever is larger).
@jhjourdan, @gasche
That would work for current OCaml, but this could result in a performance penalty in Multicore if the sampling rate is set too large.
Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario.
(@gasche: I presume you're also talking about multicore)
Generally, I'd rather wait until someone is working on and benchmarking memprof-for-multicore before we make the API more complicated. Needing to deal with concurrency in memprof callbacks is a significant burden to the user, from which the only benefit is hypothetical. (It's not obvious to me that there will even be a benefit in multicore: doing so much sampling that one core can't keep up with the callbacks sounds like an extreme use of memprof)
How about we call memprof callbacks with the same scheme that's currently used for finalisers? Only one finaliser can run at a time, but other threads are not blocked: if a thread calls caml_final_do_calls
while finalisers are being processed by another, it simply continues running user code and leaves the finalisers for the other thread.
In any case, if we do that, we should take care to suspend the current thread if we try to reentrantly run callbacks rather than postponning until the lock get unlocked, because the thread currently running the callback may very well never get control back.
Is it really worth worrying about this? If the OS scheduler is that unfair, there's not much we can do. (Also, the same concern applies to finalisers)
It won't do much traversal, even when the array is large. The whole array is only scanned every major GC cycle (when callback is reset to 0, triggering a rescan of the whole array to check if deallocation callbacks need to run). During minor collections only a suffix of the array is scanned. The length of this suffix is bounded by the number of samples on the minor heap, or the number of allocation / promotion callbacks called since the last minor GC (whichever is larger).
Pardon me, but there might be something I don't understand: when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array, hence triggering a (mostly) full traversal during the next minor collection. Sure, callbacks should not allocate much, so such full traversals should not happen too often, but still I don't see where is the bound you are speaking of.
That said, if we decide not to be reentrant, then there is a simple fix: when a callback returns a minor block, and if the corresponding entry is not in the "young zone", then swap the entry with the one immediately before the young pointer, and decrement that young pointer. Thus, the young zone will only contain entries that actually contain a young value.
Generally, I'd rather wait until someone is working on and benchmarking memprof-for-multicore before we make the API more complicated. Needing to deal with concurrency in memprof callbacks is a significant burden to the user, from which the only benefit is hypothetical. (It's not obvious to me that there will even be a benefit in multicore: doing so much sampling that one core can't keep up with the callbacks sounds like an extreme use of memprof).
Fair enough. I am a little worried by the fact that it might be difficult to evolve the API to lift the non-reentrancy guarantee, but I suppose this is just a matter of adding an optional parameter to the start
function.
Is it really worth worrying about this? If the OS scheduler is that unfair, there's not much we can do.
See #1533 and #2112: fairness with systhread is not good. That's not the OS scheduler's fault, that's because we are using a global lock which has no reason to be fair. Sure, #2112 improved the situation a lot, but this does not work on Windows, and offers no guarantee with more than 2 threads...
(Also, the same concern applies to finalisers)
True. But that's not because finalisers are broken that we should design memprof in a broken way.
Pardon me, but there might be something I don't understand: when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array, hence triggering a (mostly) full traversal during the next minor collection. Sure, callbacks should not allocate much, so such full traversals should not happen too often, but still I don't see where is the bound you are speaking of.
Ah, I see! Thank you for the explanation, I hadn't considered the minor GC's triggered by the amount allocated by the memprof callbacks themselves. However, thinking about it some more, I think the algorithm I posted above is fine (although since I hadn't thought of this issue, that's more by coincidence than design!). It turns out that this isn't true:
when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array
Only allocation and promotion callbacks can move the young pointer back, because only those can return user data. However, the callbacks triggered by a major collection are only major deallocation callbacks, and these never reset the young pointer (as there is no user data once an object's deallocated).
So, after the major cycle completes, you can only move the young pointer back after you've processed all of the old samples, and that can only move it back a small amount.
See #1533 and #2112: fairness with systhread is not good. That's not the OS scheduler's fault, that's because we are using a global lock which has no reason to be fair. Sure, #2112 improved the situation a lot, but this does not work on Windows, and offers no guarantee with more than 2 threads...
Thanks for the links, I'll have a look. (although I'm off next week, so it'll be the week after). Probably easiest just to make it reentrant for now and discuss a single-threaded API in a separate PR.
So, after the major cycle completes, you can only move the young pointer back after you've processed all of the old samples, and that can only move it back a small amount.
Wow, indeed, that's a nice trick!
I have just rebased and pushed your version, which I think is indeed a better approach than the one I had. I will work on making the run_callbacks
function reentrant, which should also make it easy to support exception raised from memprof callbacks (I am not sure whether we actually want them or not).
Thanks for cleaning this up! Sorry I haven't found time to recently. I agree that a reentrant run_callbacks is the way to go, regardless of whether exceptions raised by memprof make sense.
I've just pushed a new version which:
This should be ready to review. I asked for Inria CI in debug mode.
In this post, I discuss the following question:
which should also make it easy to support exception raised from memprof callbacks (I am not sure whether we actually want them or not).
I previously advocated that if they do not raise then we can in principle run them promptly in resource release functions, in the presence of masking. However, this would come with other kinds of code restrictions on memprof callbacks so I am not sure it is a good idea anymore.
Looking at the circumstances in which finalisers raise can be helpful as a comparison point (cf. the audit at https://github.com/ocaml/ocaml/pull/8873#issuecomment-521782243).
The latter is most appropriate for checking the total allocated memory and stopping the computation if it exceeds a user-specified bound, since the rate at which the GC alarm is called is somewhat related to allocation. This is an approach for dealing with memory exhaustion that solves many problems of the Out_of_memory exception.
For 1., we saw that people do write code that explicitly raises to report errors (Failure
, Assert_failure
, maybe others), and I am not sure that in all cases it can be shown that these exceptions do not arise. I do not know (and am not going to argue) if one is better than the other for reporting the error: raise into the main program, or force dealing with the error at the top-level of the memprof callback (where one can log and either continue or abort). But once set, people rely on the behaviour, and for instance it would be hard now to change the raising behaviour of finalisers for the reason that one would have to make up error-handling code as each author intended. So, choose wisely and early. Also, if one forbids raising, then one has to choose a default behaviour (ignore the exception or abort), and therefore a customisable handler à la Printexc.set_uncaught_exception_handler
is probably desirable.
Regarding 2., here is maybe a long shot but something to have in mind. Memprof callbacks, just like GC alarms, are good candidates for implementing something like resource limits; perhaps even better than GC alarms in the multithreaded case. Indeed, the fiber (currently thread) in which the memprof exception is raised is well-defined for minor or major allocations (and the domain in which the exception is raised would be well-defined for promotions and minor deallocations). I have found two references that tend to back up this a priori wild idea.
1) Allocation limits in Haskell (https://hackage.haskell.org/package/base-4.12.0.0/docs/System-Mem.html#v:enableAllocationLimit), as advertised here with a real-world use-case: https://simonmar.github.io/posts/2017-01-24-asynchronous-exceptions.html. They limit allocations for a thread without taking into account deallocations, thus they are some measure of the “work done” (reported to be more reliable than clock duration). So if memprof callbacks are allowed to raise, then the user could already very easily (IIUC) implement something similar to per-thread allocation limits.
2) In addition, I would like to share the following paper with you: http://www.scs.stanford.edu/~dm/home/papers/yang:space-limits.pdf. It is essentially the idea of limiting memory use as with GC alarms, but per "resource container". I can think of use-cases similar to GC alarms (e.g. limit memory consumption for some unpredictable computation) and to the Out_of_memory exception (e.g. preventing denial of service on untrusted input), where this is much more reliable than either. However, I do not know how the particular design from the paper would apply. But it gives a more general insight: the question of imposing a memory limit has much in common with profiling. So, future improvements to profiling capabilities (e.g. regarding per-thread localisation of consumption) would reflect into better accuracy in limiting memory.
As a conclusion (putting aside current issues with asynchronous exceptions) these are meaningful examples where one can raise from a memprof callback. But the idea is to supply primitives to the user as is, and let them figure out how to use them appropriately. I would be interested in experimenting per-thread allocation limits as a library for instance. For people who want to remove asynchronous exceptions, keeping the currently-raising behaviour does not change anything because they will want to remove them all at once anyway.
Thanks, @gadmm. So, if I understand well, you support the idea that statmemprof callbacks should raise exceptions?
Inria CI precheck is failing on ARM 32, on the test statmemprof/intern.ml
. This is rather hard to debug...
Using address sanitizer uncovered a bug in caml_stat_resize_noexc
(see #9119), but the failure on ARM 32 is still there.
There have been several semi-random failures on ARM 32 bits lately. It could be a hardware problem on the CI machine (flaky power supply). I'll investigate as soon as I can. In the meantime, I'll turn ARM 32 testing off.
That could be an explanation, but the failure seemed to be reproducible, since that test was the only failure for 3 successive builds.
Thanks, @gadmm. So, if I understand well, you support the idea that statmemprof callbacks should raise exceptions?
The message is that both choices make sense, and then it is all a matter of making a language design decision. What this rules out is the idea that the choice does not matter and that we can change it later (i.e. that no sane program can ever rely on raising from memprof callbacks).
As for what I think... To me there is enough evidence that bounding memory usage is useful, perhaps not for everybody, but in use-cases where it is needed it can be crucial. Now, current solutions are not good at all: as you know Out_of_memory is not reliable for memory exhaustion (as opposed to single large allocations) and unlikely to be fixed anytime soon, and the Gc.alarm trick is unlikely to scale to systhreads anytime soon either. Memprof callbacks look much more suitable for that sort of trick because they are finer-grained and one knows where they raise. On the other hand, I admit that it is a long shot.
A concrete use-case would be for implementing memory limits in Coq. Coq currently struggles on memory exhaustion, and for instance somebody asked here https://coq.discourse.group/t/memory-limit-for-commands/408 whether there was a tactic similar to Timeout
for memory limits. The Gc.alarm trick is not applicable though because it uses threads. I spoke with @ppedrot and it is not farfetched that with memprof callbacks one would be able to implement such a tactic (with caveats similar to the Timeout
tactic).
Maybe what one could do is review the API and implementation details in the perspective of the above use-cases, for instance by implementing per-thread allocation limits as an exercise (unless you tell me it will obviously work well). For instance I see that caml_memprof_suspended
is global, which seems suboptimal for the memory limit use-case, for which it would be more natural to enable memprof on a per-thread basis. On the other hand the example of allocation limits would require such a low sampling rate that the overhead is negligible AFAIU. I would volunteer to do this exercise, but my time is currently limited and I would like not to spent too much time on it. I hope this gives you enough information already to make a choice confidently.
Edit: reading my post again, I realise that it is not clear how I reply to your question. I was not sure how to formulate it because I cannot promise to use it myself, but yes, I propose that memprof callbacks can raise and that this sort of use-case is explicitly taken into account.
The bug which lead to failures with arm32 is now fixed. @stedolan : you can review this PR when you have time !
Could we make progress on this?
Yes! Apologies about the delay, I'll review today.
Thanks for cleaning this up!
I've come around to the point (made by you and @gasche) that memprof callbacks should allow reentrancy from multiple systhreads. I no longer think that "the simplest approach is to make memprof callbacks single-threaded, using a lock": that could delay callbacks for a long time, and raises some weird deadlock / lock-ordering issues if the callbacks themselves take locks. So thanks for making this reentrant!
The code for reentrancy (the idx_ptr
stuff) is tricky, though (Is this the pointers-to-thread-stacks thing you mentioned that motivated #9100 ?). The code looks good, but when reading it, a possible alternative occurred to me. I was wondering if you'd already considered it:
If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete
a no-op if counter > 0
. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr
.
Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.
If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete a no-op if counter > 0. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr.
But then, there is the possibility that the counter never (or rarely) reaches 0, which would create a leak.
Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.
Indeed. However, good coverage is particularly difficult.
If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete a no-op if counter > 0. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr.
But then, there is the possibility that the counter never (or rarely) reaches 0, which would create a leak.
Ah, good point! (I don't think this is very likely, but it's certainly possible)
Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.
Indeed. However, good coverage is particularly difficult.
True. I'm not expecting perfect coverage of all conditions here, just something that exercises the blocking-in-memprof logic. The simplest example I could think of is a memprof callback that looks like this:
let (rd, wr) = Unix.pipe ()
let block = ref false
let callback =
if !block then begin
block := false;
Unix.write (Bytes.make 1 '\0') 0 1 |> ignore
end else begin
block := true;
Unix.read (Bytes.make 1 '\0') 0 1 |> ignore
end
Some ()
Half of the Memprof calls read from the pipe and block, and the other half write to the pipe. (You'll need a single Unix.write at the end of the program, in case there are an odd number of Memprof calls in total).
Thanks, @stedolan, for the review. This should be ready to merge: I squashed the commits and added a Change entry and wrote some documentation about the fact that [Thread.exit] is unsafe in callbacks.
I also added a modified version of your test: on my computer in native mode, it turned out that most of the callbacks were run on the same thread. I am not entering too much in the details, but I don't think one pipe is enough for doing what you tried. I used two of them. Also, the trick consisting in writing a final byte in the pipe for unblocking the callback unfortunately does not work since we cannot make sure it is not written before all the callbacks have been started.
One last point: both of us are both authors and reviewers of this PR. I don't know what's the policy for code review, but it might be a good idea to get an external review.
OTOH, I don't see anyone else who understands enough of memprof.c
for reviewing this.
I would be happy to help in the next few days, ideally on the occasion of an in-person meeting to get a faster feedback loop. It may be nice to ask Damien to look at this as well -- the three of us could meet for this.
OTOH, I don't see anyone else who understands enough of memprof.c for reviewing this.
This sounds precisely like an issue that the requirement for external reviews helps solving.
Thanks @damiendoligez and @gasche , for this detailed review. I integrated all your change proposal. I also wrote a test for the bug you found. It is not clear how to write a test for the other bug.
So this should now be ready to merge, once the tests pass.
The current API for memprof uses ephemerons to registers tracked blocks. While this is an elegant use for ephemerons, but @stedolan suggested that it may not be the best thing to do, for two reasons :
This PR presents a new API for memprof: instead of using ephemerons, the client provides 5 callbacks for various events that can occur durang a sampled object lifetime: allocation (minor or major), promotion and deallocation (minor or major).
The main difficulty when implementing this new API is that the runtime now has to maintain a data structure for tracked memory blocks. The code related to this data structure is what constitutes most of the changes of this PR.
I can see two possible choices for storing this data structure: either we use
malloc
-based linked lists, or we used the same linked lists, but allocated in the major OCaml heap. The two commits of this PR are the implementations of these two choices, and we need to make a choice between these. Using the OCaml heap requires slightly simpler code, because we do not need to manage all the pointers to the OCaml heap which occur in this data structure (the GC does it for us automatically, since these are just OCaml memory blocks), and because we need not deallocate manually withfree
. On the other hand, preliminary benchmarks seem to indicate that themalloc
version is faster. Indeed,malloc
/free
is apparently not a bottleneck even if called once for each sampled block, but the extra work performed by the GC if the lists are allocated in the major heap happens to have a significant impact on the running time. In practice, when using the micro-benchmark of #8731, we have a slowdown of about 10%-15% when using the major heap.So, what should we do? I would be in favor of the
malloc
/free
version, since it is faster and a bit more readable.