Memprof new api - Githubissues

jhjourdan commented 4 years ago

The current API for memprof uses ephemerons to registers tracked blocks. While this is an elegant use for ephemerons, but @stedolan suggested that it may not be the best thing to do, for two reasons :

Ephemerons are likely to be dropped from multicore OCaml, since there is no easy way to make them work in the new parallel GC.
Ephemerons are great for noticing the deallocation of an object, but they do not give any other information on the promotion of an object from the minor to the major heap. This is unfortunate, since the client may want to exclude, for performance reasons, short-lived non-promoted objects from its statistics.

This PR presents a new API for memprof: instead of using ephemerons, the client provides 5 callbacks for various events that can occur durang a sampled object lifetime: allocation (minor or major), promotion and deallocation (minor or major).

The main difficulty when implementing this new API is that the runtime now has to maintain a data structure for tracked memory blocks. The code related to this data structure is what constitutes most of the changes of this PR.

I can see two possible choices for storing this data structure: either we use malloc-based linked lists, or we used the same linked lists, but allocated in the major OCaml heap. The two commits of this PR are the implementations of these two choices, and we need to make a choice between these. Using the OCaml heap requires slightly simpler code, because we do not need to manage all the pointers to the OCaml heap which occur in this data structure (the GC does it for us automatically, since these are just OCaml memory blocks), and because we need not deallocate manually with free. On the other hand, preliminary benchmarks seem to indicate that the malloc version is faster. Indeed, malloc/free is apparently not a bottleneck even if called once for each sampled block, but the extra work performed by the GC if the lists are allocated in the major heap happens to have a significant impact on the running time. In practice, when using the micro-benchmark of #8731, we have a slowdown of about 10%-15% when using the major heap.

So, what should we do? I would be in favor of the malloc/free version, since it is faster and a bit more readable.

lpw25 commented 4 years ago

I like the new API. One tiny suggestion: it might be a good idea to make alloc_info into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call it allocation, but that's totally subjective.

jhjourdan commented 4 years ago

I like the new API. One tiny suggestion: it might be a good idea to make alloc_info into an abstract type. I don't think users gain anything by knowing its a record and this should make it a little easier to change the type if necessary. I'd also probably just call it allocation, but that's totally subjective.

What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API? As far as I know, private record can be extended without compatibility issues.

chambart commented 4 years ago

This is version is quite nice. However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.

One other nice addition would be to allow the user to add a value to the tracked list. But this can be done later, outside of this PR.

lpw25 commented 4 years ago

What would you think about using a private type, so that we avoid an heavy list of accessors for this type in the API?

A private record is good too (although I don't think the list of accessors is actually longer than the record definition).

gasche commented 4 years ago

@chambart

However, the previous version provided a direct access to the value which allowed to track it specifically, which this one is lacking. For instance if you wanted to traverse the memory graph to know why a specific value was alive.

This access to the value as an Obj.t was part of a previous iteration of the API, and it was removed to simplify the API. There were at last the following concerns with that part of the API:

At the time where the callback is invoked, the Obj.t value is invalid: the memory space has been allocated (with a valid header) but the value itself is still missing -- in particular, it is not a valid memory representation at the value's type. It would be possible to store the Obj.t somewhere and use it later, after it has been initialized, but there is no way using the API to tell when that initialization has occurred, so you have to poke at the Obj.t value using the Obj API to try to guess whether it is a valid value, and which type it has. This seems deeply unsafe and fairly error-prone.
If I understand correctly, the trick of calling the callback after the value is allocated only works for allocations in the major heap (so callback_alloc_major and callback_promote), not for allocations in the minor heap; @jhjourdan thus had different allocation_info types for minor and major allocations, which made the API irritatingly less uniform.
At the time we discussed this (last Tuesday) we knew of no actual client making use of this Obj.t value. (One use-case we discussed is someone trying to analyze a program where most allocations happen at a very specific type whose memory representation is easily inspectable and contains useful information (typically, a meaningful unique identifier), but we didn't have any concrete example in mind.)

Given all this it seemed reasonable to remove the Obj.t access from this iteration of the API.

.

I would be curious to understand your proposed use-case. Your idea is that at a given time in the program execution we can trace the whole live heap, record a path from memory roots to each value, and store that path for those values that are currently being sampled?

jhjourdan commented 4 years ago

The callbacks let you return a None -- in fact it looks like the previous API also did, I had just never wondered about it. What's the semantics if you return None, the value is not instrumented, but does this affect the sampling in any way? (How do we reason about uniformity in this case?)

As documented:

If such a callback returns [None], then the tracking of this particular block is cancelled.

As for uniformity: if you use this mechanism to filter out some blocks, then you sample according to the probability distribution of blocks that satisfy the given filter. You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.

lpw25 commented 4 years ago

You can imagine, for example, considering only blocks that are allocated during the execution of some parts of the program.

Another use case would be sampling allocations without caring about how long they were live. Spacetime's viewer has a graph of just allocations over time and it is often quite useful.

jhjourdan commented 4 years ago

Apparently, one test fails with MinGW32, with a weird error code. This looks like a segfault, but I cannot reproduce on linux (even with Valgrind with the debug runtime).

chambart commented 4 years ago

@gasche I don't need to have access to the value itself, but to be able to follow a specific value. For instance a fresh int would be sufficient, and would in fact be easier to follow than an Obj.t. My use case would be to produce a dump of the heap and be able to point in the memory graph which are the values followed by memprof.

chambart commented 4 years ago

Otherwise a specific type which usually matter a lot for memory usage are bigarrays, and can be worth following specifically and introspecting. Given that this comes from C side allocations, it could have a different API that triggers when returning from C to OCaml, where we know that the value should be completely initialized. But this seems outside of the scope of that PR.

gasche commented 4 years ago

Maybe we could have a generic approach to track custom values, not just bigarrays; in particular, we could use their "virtual size" to decide how to sample them. This is indeed outside the scope of the present PR.

jhjourdan commented 4 years ago

@stedolan : gentle ping. Are you aware that this PR exists? Are you planning to do a review?

stedolan commented 4 years ago

@jhjourdan Yes and yes! Apologies for the delay, I've been pretty busy (I was visiting the JS NYC office over the last two weeks, just back now).

stedolan commented 4 years ago

The new API looks great. (I'm particularly looking forward to sampling blocks that get promoted).

I haven't finished reading the implementations yet, but abstractly I prefer the malloc-based version, both for performance and because it has less impact on the GC behaviour of the program being traced.

I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)

Ephemerons are likely to be dropped from multicore OCaml, since there is no easy way to make them work in the new parallel GC.

To clarify: Ephemerons are supported in multicore, as @kayceesrk's already done the hard work of making them work. However, I would like to revisit the API (particularly the mutable bits like set_key), since the current one causes quite a lot of headache in multicore. So, I'd much prefer if the Memprof API didn't depend on mutating the key of an ephemeron.

kayceesrk commented 4 years ago

Here is the relevant issue in multicore repo that discusses the ephemerons implementation https://github.com/ocaml-multicore/ocaml-multicore/issues/88

But as @stedolan pointed out, I’d much prefer not using set_key as a cleaner, alternative API does not provide set_key.

jhjourdan commented 4 years ago

I am surprised by the amount of code involved here. I'd hoped that it would be possible to reuse the finalisation mechanism (the existing code in finalise.c) to trigger these callbacks, rather than having to reimplement finalisation in memprof.c. Did you try this? Is there some deep problem I'm missing? (If not, I'll have a go prototyping this approach on Friday)

I did not try this, but I don't think this will simplify the code significantly. Much of the complexity is in a data structure which makes it possible to:

remember postponed callbacks,
thread the data provided by the user through the various callbacks,
track the events happening to the tracked blocks,
register the right weak and strong roots to the GC.

All this complexity already arises with promotion only, and delegating deallocation to finalisers will only marginally simplify the code.

Moreover, note that the semantics for finilisers is slightly different: a finiliser is a closure attached to a block, while memprof attaches some user data that has to be given to a globally defined closure. Resetting this globally defined closure cancels the tracking of the corresponding blocks, which is not something which is currently possible with finalisers. That's not an untrackable issue with your approach, but at least this is an annoyance.

jhjourdan commented 4 years ago

Ah, now that I think about it, I understand what you mean by reusing the mechanism of finalizers. You will have to extend finilizers to support "promotion finalizers", and do some tricks (using references?) to allow threading user data through the various callbakcs.

This can certainly work, that's effectively an intermediary solution between the malloc-based and the heap-based solution I proposed, since some of the data structures will be in the OCaml heap (the references used to thread the user data) and some will be in the malloc heap (the finalizers information). This will probably have more overhead (since data structures would be allocated both on the OCaml and the malloc heaps), but that might be acceptable.

Feel free to give it a try. I don't forsee any more difficulties than the ones I presented above.

stedolan commented 4 years ago

Thanks for the explanation! I'll have a go at the finaliser-based one today and see how it looks.

stedolan commented 4 years ago

I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.

Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.

stedolan commented 4 years ago

Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:

	short, untracked	short, tracked	long, untracked	long, tracked
09c16c7a0f	0.88	0.93	2.81	3.03
6e035594fc	0.85	0.85	2.81	2.94
trunk	0.84	1.30	2.72	5.69

Benchmark code here. All numbers are median-of-5 executions in seconds.

Edit: I've just added numbers for trunk, using the previous ephemeron-based API. The 'untracked' versions do not create an ephemeron, while the 'tracked' versions create an ephemeron and store the last 16k ephemerons in an array. (If the ephemeron is just discarded, then it's much faster as the ephemeron itself gets collected. But then no actual lifetime tracking gets done).

With the trunk numbers, either implementation of this API seems to be much, much faster than using ephemerons to track lifetimes.

jhjourdan commented 4 years ago

I wrote the finaliser-style one. Because of the differences between finalisers and memprof that you mentioned (particularly promotion callbacks), I ended up reimplementing some of the data structures from finalise.c and not sharing code after all.

Despite that, it still ends up being less code than the malloc-based linked lists. Instead of multiple linked lists, there's a single array of active samples. The code's available here, I'm going to run some benchmarks now and post the results here.

Thanks for this attempt. Indeed, this is particularly shorter with this single array of tracked blocks. However, as is, I have several concerns:

In a multithreaded setting, a context switch can happen during a callback. As a result, the function caml_memprof_handle_postponed needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len); on line 396 and i == tracked.len - 1 on line 427.
In caml_memprof_oldify_young_roots, you start the traversal at tracked.young, but don't update tracked.young in run_callbacks when updating user_data. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data list.

jhjourdan commented 4 years ago

Here's some benchmark results, for both short-lived and long-lived objects, and for tracked (i.e. callback returns Some ()) and untracked (callback returns None) samples:

Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).

stedolan commented 4 years ago

In a multithreaded setting, a context switch can happen during a callback. As a result, the function caml_memprof_handle_postponed needs to be reentrant, and, as far as I understand, this is not the case currently. At least two assertions are broken: CAMLassert(tracked.len == prev_len); on line 396 and i == tracked.len - 1 on line 427.

Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.

In caml_memprof_oldify_young_roots, you start the traversal at tracked.young, but don't update tracked.young in run_callbacks when updating user_data. This can end up in a callback returning a young value which is not taken into account at the next minor collection. Of course, the issue here is that it is (probably ?) intractable to update the young pointer when writing user_data, because this could end up in reexamining a large part of the array at the next minor collection. That was the purpose of my young_user_data list.

Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.

Thanks for the benchmarks. My conclusion is that (not so surprisingly) ephemerons add a significant overhead for tracked blocks. But the two different implementation of the new API are close to tie: yours is a bit faster (<= 5%), but I suspect it requires a bit of debugging (see my previous message).

Agreed. I'll do some debugging and see how it goes. A performance improvement is nice, but my main motivation here is shorter code.

The other feature I haven't implemented yet is handling of exceptions raised from memprof callbacks. I'm wondering what the right thing to do here is: is there any reason to want these? Shouldn't they just be a fatal error?

gasche commented 4 years ago

I think the simplest approach is to make memprof callbacks single-threaded, using a lock

Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario.

The most common memprof callback is to just return/store the backtrace. We can expect most callbacks to be performing aggregation, which is easy to do in a thread-local way and gather at statistics-production time. This sounds like a problem domain where thread-safety is natural.

(Of course it also makes sense to aim for implementation simplicity at first, and refine the implementation if we do observe a bottleneck in practice.)

jhjourdan commented 4 years ago

Hmmm. Thinking about this, I think the simplest approach is to make memprof callbacks single-threaded, using a lock. As well as not having to worry about reentrancy in the runtime, this also makes life simpler for the user of the memprof API, who need not worry about making the callbacks themselves threadsafe.

That would work for current OCaml, but this could result in a performance penalty in Multicore if the sampling rate is set too large. In any case, if we do that, we should take care to suspend the current thread if we try to reentrantly run callbacks rather than postponning until the lock get unlocked, because the thread currently running the callback may very well never get control back.

Thanks for spotting this bug, but I think the fix is very easy! Updating the young pointer when writing user_data is fine: at the next minor GC, the part of the array examined will be the shortest suffix of the array containing recently sampled blocks and recently executed callbacks.

Yes, but this means re-traversing (and possibly shifting) a large part of the array of tracked block at the next minor collection, which can be costly. Remember that this array can easily be rather large (> 10^5 entries) for large heaps (> 1Go) and reasonable sampling rates (1/1000).

stedolan commented 4 years ago

I've pushed the fixes to https://github.com/stedolan/ocaml/tree/memprof_new_api (with tests)

@jhjourdan

Yes, but this means re-traversing (and possibly shifting) a large part of the array of tracked block at the next minor collection, which can be costly. Remember that this array can easily be rather large (> 10^5 entries) for large heaps (> 1Go) and reasonable sampling rates (1/1000).

It won't do much traversal, even when the array is large. The whole array is only scanned every major GC cycle (when callback is reset to 0, triggering a rescan of the whole array to check if deallocation callbacks need to run). During minor collections only a suffix of the array is scanned. The length of this suffix is bounded by the number of samples on the minor heap, or the number of allocation / promotion callbacks called since the last minor GC (whichever is larger).

@jhjourdan, @gasche

That would work for current OCaml, but this could result in a performance penalty in Multicore if the sampling rate is set too large.

Aren't we risking a sequential bottleneck at moderate sampling rates (say 1/100)? If a few minor allocations are both much slower than the other and sequential, this sounds like a bad yet fairly common/natural scenario.

(@gasche: I presume you're also talking about multicore)

Generally, I'd rather wait until someone is working on and benchmarking memprof-for-multicore before we make the API more complicated. Needing to deal with concurrency in memprof callbacks is a significant burden to the user, from which the only benefit is hypothetical. (It's not obvious to me that there will even be a benefit in multicore: doing so much sampling that one core can't keep up with the callbacks sounds like an extreme use of memprof)

How about we call memprof callbacks with the same scheme that's currently used for finalisers? Only one finaliser can run at a time, but other threads are not blocked: if a thread calls caml_final_do_calls while finalisers are being processed by another, it simply continues running user code and leaves the finalisers for the other thread.

In any case, if we do that, we should take care to suspend the current thread if we try to reentrantly run callbacks rather than postponning until the lock get unlocked, because the thread currently running the callback may very well never get control back.

Is it really worth worrying about this? If the OS scheduler is that unfair, there's not much we can do. (Also, the same concern applies to finalisers)

jhjourdan commented 4 years ago

It won't do much traversal, even when the array is large. The whole array is only scanned every major GC cycle (when callback is reset to 0, triggering a rescan of the whole array to check if deallocation callbacks need to run). During minor collections only a suffix of the array is scanned. The length of this suffix is bounded by the number of samples on the minor heap, or the number of allocation / promotion callbacks called since the last minor GC (whichever is larger).

Pardon me, but there might be something I don't understand: when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array, hence triggering a (mostly) full traversal during the next minor collection. Sure, callbacks should not allocate much, so such full traversals should not happen too often, but still I don't see where is the bound you are speaking of.

That said, if we decide not to be reentrant, then there is a simple fix: when a callback returns a minor block, and if the corresponding entry is not in the "young zone", then swap the entry with the one immediately before the young pointer, and decrement that young pointer. Thus, the young zone will only contain entries that actually contain a young value.

Generally, I'd rather wait until someone is working on and benchmarking memprof-for-multicore before we make the API more complicated. Needing to deal with concurrency in memprof callbacks is a significant burden to the user, from which the only benefit is hypothetical. (It's not obvious to me that there will even be a benefit in multicore: doing so much sampling that one core can't keep up with the callbacks sounds like an extreme use of memprof).

Fair enough. I am a little worried by the fact that it might be difficult to evolve the API to lift the non-reentrancy guarantee, but I suppose this is just a matter of adding an optional parameter to the start function.

Is it really worth worrying about this? If the OS scheduler is that unfair, there's not much we can do.

See #1533 and #2112: fairness with systhread is not good. That's not the OS scheduler's fault, that's because we are using a global lock which has no reason to be fair. Sure, #2112 improved the situation a lot, but this does not work on Windows, and offers no guarantee with more than 2 threads...

(Also, the same concern applies to finalisers)

True. But that's not because finalisers are broken that we should design memprof in a broken way.

stedolan commented 4 years ago

Pardon me, but there might be something I don't understand: when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array, hence triggering a (mostly) full traversal during the next minor collection. Sure, callbacks should not allocate much, so such full traversals should not happen too often, but still I don't see where is the bound you are speaking of.

Ah, I see! Thank you for the explanation, I hadn't considered the minor GC's triggered by the amount allocated by the memprof callbacks themselves. However, thinking about it some more, I think the algorithm I posted above is fine (although since I hadn't thought of this issue, that's more by coincidence than design!). It turns out that this isn't true:

when a major callback traversal is triggered, then every callback will potentially move the young pointer towards the beginning of the array

Only allocation and promotion callbacks can move the young pointer back, because only those can return user data. However, the callbacks triggered by a major collection are only major deallocation callbacks, and these never reset the young pointer (as there is no user data once an object's deallocated).

So, after the major cycle completes, you can only move the young pointer back after you've processed all of the old samples, and that can only move it back a small amount.

See #1533 and #2112: fairness with systhread is not good. That's not the OS scheduler's fault, that's because we are using a global lock which has no reason to be fair. Sure, #2112 improved the situation a lot, but this does not work on Windows, and offers no guarantee with more than 2 threads...

Thanks for the links, I'll have a look. (although I'm off next week, so it'll be the week after). Probably easiest just to make it reentrant for now and discuss a single-threaded API in a separate PR.

jhjourdan commented 4 years ago

So, after the major cycle completes, you can only move the young pointer back after you've processed all of the old samples, and that can only move it back a small amount.

Wow, indeed, that's a nice trick!

jhjourdan commented 4 years ago

I have just rebased and pushed your version, which I think is indeed a better approach than the one I had. I will work on making the run_callbacks function reentrant, which should also make it easy to support exception raised from memprof callbacks (I am not sure whether we actually want them or not).

stedolan commented 4 years ago

Thanks for cleaning this up! Sorry I haven't found time to recently. I agree that a reentrant run_callbacks is the way to go, regardless of whether exceptions raised by memprof make sense.

jhjourdan commented 4 years ago

I've just pushed a new version which:

Fixes a few more bugs
Allows callback to raise exception (because we can... but this is probably not useful)
Make the callback system reentrant (hence make this PR compatible with systhreads)

This should be ready to review. I asked for Inria CI in debug mode.

gadmm commented 4 years ago

In this post, I discuss the following question:

which should also make it easy to support exception raised from memprof callbacks (I am not sure whether we actually want them or not).

I previously advocated that if they do not raise then we can in principle run them promptly in resource release functions, in the presence of masking. However, this would come with other kinds of code restrictions on memprof callbacks so I am not sure it is a good idea anymore.

Looking at the circumstances in which finalisers raise can be helpful as a comparison point (cf. the audit at https://github.com/ocaml/ocaml/pull/8873#issuecomment-521782243).

Report an error, by explicit choice of the programmer, or implicitly.
Perform a check in a GC alarm, and interrupt computation if not met. (GC alarms are implemented with finalisers.)

The latter is most appropriate for checking the total allocated memory and stopping the computation if it exceeds a user-specified bound, since the rate at which the GC alarm is called is somewhat related to allocation. This is an approach for dealing with memory exhaustion that solves many problems of the Out_of_memory exception.

For 1., we saw that people do write code that explicitly raises to report errors (Failure, Assert_failure, maybe others), and I am not sure that in all cases it can be shown that these exceptions do not arise. I do not know (and am not going to argue) if one is better than the other for reporting the error: raise into the main program, or force dealing with the error at the top-level of the memprof callback (where one can log and either continue or abort). But once set, people rely on the behaviour, and for instance it would be hard now to change the raising behaviour of finalisers for the reason that one would have to make up error-handling code as each author intended. So, choose wisely and early. Also, if one forbids raising, then one has to choose a default behaviour (ignore the exception or abort), and therefore a customisable handler à la Printexc.set_uncaught_exception_handler is probably desirable.

Regarding 2., here is maybe a long shot but something to have in mind. Memprof callbacks, just like GC alarms, are good candidates for implementing something like resource limits; perhaps even better than GC alarms in the multithreaded case. Indeed, the fiber (currently thread) in which the memprof exception is raised is well-defined for minor or major allocations (and the domain in which the exception is raised would be well-defined for promotions and minor deallocations). I have found two references that tend to back up this a priori wild idea.

1) Allocation limits in Haskell (https://hackage.haskell.org/package/base-4.12.0.0/docs/System-Mem.html#v:enableAllocationLimit), as advertised here with a real-world use-case: https://simonmar.github.io/posts/2017-01-24-asynchronous-exceptions.html. They limit allocations for a thread without taking into account deallocations, thus they are some measure of the “work done” (reported to be more reliable than clock duration). So if memprof callbacks are allowed to raise, then the user could already very easily (IIUC) implement something similar to per-thread allocation limits.

2) In addition, I would like to share the following paper with you: http://www.scs.stanford.edu/~dm/home/papers/yang:space-limits.pdf. It is essentially the idea of limiting memory use as with GC alarms, but per "resource container". I can think of use-cases similar to GC alarms (e.g. limit memory consumption for some unpredictable computation) and to the Out_of_memory exception (e.g. preventing denial of service on untrusted input), where this is much more reliable than either. However, I do not know how the particular design from the paper would apply. But it gives a more general insight: the question of imposing a memory limit has much in common with profiling. So, future improvements to profiling capabilities (e.g. regarding per-thread localisation of consumption) would reflect into better accuracy in limiting memory.

As a conclusion (putting aside current issues with asynchronous exceptions) these are meaningful examples where one can raise from a memprof callback. But the idea is to supply primitives to the user as is, and let them figure out how to use them appropriately. I would be interested in experimenting per-thread allocation limits as a library for instance. For people who want to remove asynchronous exceptions, keeping the currently-raising behaviour does not change anything because they will want to remove them all at once anyway.

jhjourdan commented 4 years ago

Thanks, @gadmm. So, if I understand well, you support the idea that statmemprof callbacks should raise exceptions?

jhjourdan commented 4 years ago

Inria CI precheck is failing on ARM 32, on the test statmemprof/intern.ml. This is rather hard to debug...

jhjourdan commented 4 years ago

Using address sanitizer uncovered a bug in caml_stat_resize_noexc (see #9119), but the failure on ARM 32 is still there.

xavierleroy commented 4 years ago

There have been several semi-random failures on ARM 32 bits lately. It could be a hardware problem on the CI machine (flaky power supply). I'll investigate as soon as I can. In the meantime, I'll turn ARM 32 testing off.

jhjourdan commented 4 years ago

That could be an explanation, but the failure seemed to be reproducible, since that test was the only failure for 3 successive builds.

gadmm commented 4 years ago

Thanks, @gadmm. So, if I understand well, you support the idea that statmemprof callbacks should raise exceptions?

The message is that both choices make sense, and then it is all a matter of making a language design decision. What this rules out is the idea that the choice does not matter and that we can change it later (i.e. that no sane program can ever rely on raising from memprof callbacks).

As for what I think... To me there is enough evidence that bounding memory usage is useful, perhaps not for everybody, but in use-cases where it is needed it can be crucial. Now, current solutions are not good at all: as you know Out_of_memory is not reliable for memory exhaustion (as opposed to single large allocations) and unlikely to be fixed anytime soon, and the Gc.alarm trick is unlikely to scale to systhreads anytime soon either. Memprof callbacks look much more suitable for that sort of trick because they are finer-grained and one knows where they raise. On the other hand, I admit that it is a long shot.

A concrete use-case would be for implementing memory limits in Coq. Coq currently struggles on memory exhaustion, and for instance somebody asked here https://coq.discourse.group/t/memory-limit-for-commands/408 whether there was a tactic similar to Timeout for memory limits. The Gc.alarm trick is not applicable though because it uses threads. I spoke with @ppedrot and it is not farfetched that with memprof callbacks one would be able to implement such a tactic (with caveats similar to the Timeout tactic).

Maybe what one could do is review the API and implementation details in the perspective of the above use-cases, for instance by implementing per-thread allocation limits as an exercise (unless you tell me it will obviously work well). For instance I see that caml_memprof_suspended is global, which seems suboptimal for the memory limit use-case, for which it would be more natural to enable memprof on a per-thread basis. On the other hand the example of allocation limits would require such a low sampling rate that the overhead is negligible AFAIU. I would volunteer to do this exercise, but my time is currently limited and I would like not to spent too much time on it. I hope this gives you enough information already to make a choice confidently.

Edit: reading my post again, I realise that it is not clear how I reply to your question. I was not sure how to formulate it because I cannot promise to use it myself, but yes, I propose that memprof callbacks can raise and that this sort of use-case is explicitly taken into account.

jhjourdan commented 4 years ago

The bug which lead to failures with arm32 is now fixed. @stedolan : you can review this PR when you have time !

jhjourdan commented 4 years ago

Could we make progress on this?

stedolan commented 4 years ago

Yes! Apologies about the delay, I'll review today.

stedolan commented 4 years ago

Thanks for cleaning this up!

I've come around to the point (made by you and @gasche) that memprof callbacks should allow reentrancy from multiple systhreads. I no longer think that "the simplest approach is to make memprof callbacks single-threaded, using a lock": that could delay callbacks for a long time, and raises some weird deadlock / lock-ordering issues if the callbacks themselves take locks. So thanks for making this reentrant!

The code for reentrancy (the idx_ptr stuff) is tricky, though (Is this the pointers-to-thread-stacks thing you mentioned that motivated #9100 ?). The code looks good, but when reading it, a possible alternative occurred to me. I was wondering if you'd already considered it:

If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete a no-op if counter > 0. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr.

Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.

jhjourdan commented 4 years ago

If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete a no-op if counter > 0. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr.

But then, there is the possibility that the counter never (or rarely) reaches 0, which would create a leak.

Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.

Indeed. However, good coverage is particularly difficult.

stedolan commented 4 years ago

If we maintained a counter for the number of currently-executing callbacks (incremented/decremented by run_callback_exn), then we could make flush_delete a no-op if counter > 0. That way, indexes never change while a callback is running, so we don't need to update *idx_ptr.

But then, there is the possibility that the counter never (or rarely) reaches 0, which would create a leak.

Ah, good point! (I don't think this is very likely, but it's certainly possible)

Regardless of which approach is taken, this is subtle code that needs some tests that explicitly block in memprof callbacks using systhreads.

Indeed. However, good coverage is particularly difficult.

True. I'm not expecting perfect coverage of all conditions here, just something that exercises the blocking-in-memprof logic. The simplest example I could think of is a memprof callback that looks like this:

let (rd, wr) = Unix.pipe ()
let block = ref false
let callback = 
  if !block then begin
    block := false;
    Unix.write (Bytes.make 1 '\0') 0 1 |> ignore
  end else begin
    block := true;
    Unix.read (Bytes.make 1 '\0') 0 1 |> ignore
  end
  Some ()

Half of the Memprof calls read from the pipe and block, and the other half write to the pipe. (You'll need a single Unix.write at the end of the program, in case there are an odd number of Memprof calls in total).

jhjourdan commented 4 years ago

Thanks, @stedolan, for the review. This should be ready to merge: I squashed the commits and added a Change entry and wrote some documentation about the fact that [Thread.exit] is unsafe in callbacks.

I also added a modified version of your test: on my computer in native mode, it turned out that most of the callbacks were run on the same thread. I am not entering too much in the details, but I don't think one pipe is enough for doing what you tried. I used two of them. Also, the trick consisting in writing a final byte in the pipe for unblocking the callback unfortunately does not work since we cannot make sure it is not written before all the callbacks have been started.

jhjourdan commented 4 years ago

One last point: both of us are both authors and reviewers of this PR. I don't know what's the policy for code review, but it might be a good idea to get an external review.

OTOH, I don't see anyone else who understands enough of memprof.c for reviewing this.

gasche commented 4 years ago

I would be happy to help in the next few days, ideally on the occasion of an in-person meeting to get a faster feedback loop. It may be nice to ask Damien to look at this as well -- the three of us could meet for this.

OTOH, I don't see anyone else who understands enough of memprof.c for reviewing this.

This sounds precisely like an issue that the requirement for external reviews helps solving.

jhjourdan commented 4 years ago

Thanks @damiendoligez and @gasche , for this detailed review. I integrated all your change proposal. I also wrote a test for the bug you found. It is not clear how to write a test for the other bug.

jhjourdan commented 4 years ago

So this should now be ready to merge, once the tests pass.

ocaml / ocaml

Memprof new api #8920