Closed gdalle closed 2 weeks ago
@tpapp @devmotion thoughts? This is a strict addition of features, it does not modify any of the existing dispatches.
I think the test errors are due to the breaking version of Enzyme, which is why #38 might have higher priority
Thanks for the ping, I was a bit busy last week to review this.
This looks like a very lightweight addition that at the same time enables the use of DifferentiationInterfaces (for all supported backends), which extends the functionality of the package, and in the long run also allows replacing existing backends with DI as the code matures.
Tests currently do not run, I think Enzyme compat needs to be broadened.
No worries, thanks for the review, I'll take your remarks into account.
Tests currently do not run, I think Enzyme compat needs to be broadened.
Not possible, the Enzyme v0.13 change was very breaking and DI cannot afford to support every past version, so I used their breaking change as an opportunity to tag mine as well. Perhaps as a temporary solution we could run the DI tests in another environment where Enzyme is not?
@gdalle: note: we just merged #38.
@willtebbutt does this clash with the Mooncake extension for LogDensityProblemsAD?
I have no idea -- if it does, I'm more than happy to remove my extension and rely on the contents of this PR. Me having to look after less code is never a problem.
@gdalle: I am wondering if a wrapper like
with_preparation(ADgradient(backend, ℓ), zeros(3)))
could provide a reasonable API, without keywords. Would not even need a separate DIgradient
struct, existing could default to prep = nothing
and the above would just replace it with x
.
My idea here was to mimick the existing API as closely as possible. Some constructors of ADgradient
using symbols can also take an x
as an optional keyword argument:
https://github.com/tpapp/LogDensityProblemsAD.jl/blob/2ce49ce6705bbf35e46ee328f793b9eaaf78546c/ext/LogDensityProblemsADForwardDiffExt.jl#L96-L99
https://github.com/tpapp/LogDensityProblemsAD.jl/blob/2ce49ce6705bbf35e46ee328f793b9eaaf78546c/ext/LogDensityProblemsADReverseDiffExt.jl#L45-L47
They also take other kwargs like config or compile information, but with ADTypes this is stored in the backend object itself so we no longer need to pass it
Tests pass locally
@devmotion @tpapp is this better with the latest changes?
@gdalle: Thanks for the recent updates. I understand and appreciate that you want to keep the API consistent with the existing one.
However, that API predates the AD-unification libraries (like DI) and is not particularly well designed because it does not reify the AD process. Specifically, I now believe that the ideal API would be something like
ADgradient(how, P)
where P
is a ℝⁿ→ℝ function and how
contains all information on how to AD.
In contrast, currently we have
AGgradient(how_backend, P; how_details...)
and your PR (in its current state) extends the existing API in this direction.
In fact, DI does not reify how
either: if you want preparation you do it via one of the API functions.
So there are two questions:
AGgradient(how_backend, P; how_details...)
, either in the short run or forever,how
in a way that makes sense (I am assuming this is possible, please correct me if it is not).I appreciate your work on DI and your PRs here, and please understand that I am not pushing back on changes. I think DI is a great idea, but I want to do it right so that this package breaks its own API the fewest times possible (eventually, I want to encourage users to move on the the new API, and deprecate & remove the existing one).
@devmotion, what do you think?
Thanks for your kind answer @tpapp, and for your work on this part of the ecosystem.
- do we want to keep the existing API
ADgradient(how_backend, P; how_details...)
, either in the short run or forever- if not, do we take this opportunity to improve it,
In my view, the AD extensions of LogDensityProblemsAD filled a crucial void when DI did not exist. Now that DI is in a pretty good state, I don't know if this ADGradient
API will remain necessary for much longer. Thus, my proposal was a minimally invasive insertion, designed to encourage gradual pivots to DI in the future without needing breaking changes here or in Turing. Perhaps someday, when DI is truly ready, we won't even need LogDensityProblemsAD at all?
Of course, to get peak performance or avoid some bugs, you still want to tune the bindings for every backend. But if every Julia package does that separately, it is a huge waste of time and LOCs. My hope is that this tuning can be done in a single place and fit 99% of use cases, which is what DI is for. I'm always open to suggestions for performance or design improvements. Besides, the case that we are tackling here (gradient of array-input function with constant contexts) is exactly the case where we can be extremely performant with DI, which makes it a prime candidate for the switch.
- can DI provide an API that reifies how in a way that makes sense (I am assuming this is possible, please correct me if it is not).
The DI interface with preparation looks like this:
gradient(f, prep, backend, x, contexts...)
where backend
is an object from ADTypes.jl and prep
is the result of
prepare_gradient(f, backend, typical_x, typical_contexts...)
In your terms:
backend
encapsulates the how
that applies to every function and input (number of chunks for ForwardDiff, compilation behavior for ReverseDiff, mode for Enzyme, etc.)prep
encapsulates the how
that is specific to the function f
and to the type and size of the input typical_x
(configs for ForwardDiff, tape for ReverseDiff, etc.)This shows that there are two sides to the how
, and I think it makes sense to distinguish them.
So where do you wanna go from here?
Perhaps someday, when DI is truly ready, we won't even need LogDensityProblemsAD at all?
Possibly, but that is speculation. At the moment, there is no generic AD wrapper interface that provides what this package does. Preparation, as you explained above, is one example.
So where do you wanna go from here?
I want to reflect a bit on this, and also hear comments from the users.
Currently I am leaning towards cleaning up the interface the following way:
ADgradient(how, P)
where how
encapculates everything we need for AD,
each backend gets a constructor that replaces the current Val{symbol}
and backend::Symbol
API. This constructor takes keywords and wharever is needed.
We could of course merge your PR as is, then later deprecate this.
At the moment, there is no generic AD wrapper interface that provides what this package does.
Well, I would love for DI to provide this. What do you think is missing then?
We could of course merge your PR as is, then later deprecate this.
The idea of this PR was to be minimally invasive, so that you can gradually drop extensions in favor of a better-maintained and tested DI implementation. Therefore, I think it is a good idea to merge it before a possible breaking revamp of LogDensityProblemsAD, especially if you want to use more of DI in the revamp.
What do you think is missing then?
A way to pack everything in how
(including prep, and whatever is needed), as explained above.
I think it is a good idea to merge it before a possible breaking revamp of LogDensityProblemsAD
As I said above, that is a possibility I am considering. I will wait for thought from @devmotion.
A way to pack everything in how (including prep, and whatever is needed), as explained above.
If you want to use only DI, this is as simple as something like
struct How{B,P}
backend::B
prep::P
end
But if you want to also adapt this to your existing extensions, then of course it's a bit more work. I'll let you weigh the pros and cons.
Hmm... I think conceptually keeping backend
and prep
separated feels a bit cleaner to me. There's information about the desired AD backend that is independent from the log density problem, its dimension etc (e.g., I want to use Enzyme + reverse mode) and there's information that depends on the problem at hand (e.g., type and length of the input to the log density function, the function itself). Having them separate makes it easier to pass the problem-independent settings around and reuse them for other log-density problems. For instance, in Turing a user might want to specify the AD backend to be used by the sampler, but at that time point (when constructing the sampler) the actual log-density problem is not created yet (that only happens internally in Turing).
What I dislike about the Val
interface is that it does not allow to pass around any additional information apart from the AD backend, and hence the keyword arguments contain both problem-independent information (like the Enzyme mode or ForwardDiff chunk size or tags) and problem-dependent information (like a typical input).
I think deprecating or directly replacing the Val
interface with the ADTypes interface would resolve this issue. Everything that's problem-independent you could store and reuse by specifying the ADType, and problem-dependent settings such as typical inputs you could specify with keyword arguments.
@devmotion I agree that ADTypes are overall more expressive than symbols, which is why they were introduced. But even deprecating the Val
API won't solve the heterogeneity between backend. Currently, you need to pass different keyword arguments depending on which backend you want to use (shadow
for forward Enzyme, chunks
for ForwardDiff, etc.). The appeal of DI is to perform this preparation in the same way everywhere, so that the user can just pass x
and switch backends transparently while preserving performance.
In my previous attempt #29, the main obstacles to full DI adoption were
Const
for EnzymeThe first one has been resolved, the second one is much more a Tracker issue than a DI one. @tpapp concluded his review of #29 by saying (emphasis mine)
Yes, this package by necessity and historical reasons duplicates a lot of functionality in an abstract AD metapackage. This was made much easier by the fact that we only care about
R^n → R
functions. But the code is already there and in most cases it works fine.
Sure, your own AD interface has already been written, but it still needs to be updated and whenever any backend changes (e.g. #37 and #38 for the latest Enzyme). Since DI is to become the standard (already used in Optimization.jl, NonlinearSolve.jl and more), it will remain actively maintained and react to evolutions of the ecosystem (like the new Mooncake.jl package). The way things work at the moment, you also need to perform the same adaptations in parallel, or give up on the latest features, both of which are a bit of a waste.
But even deprecating the Val API won't solve the heterogeneity between backend. Currently, you need to pass different keyword arguments depending on which backend you want to use (shadow for forward Enzyme, chunks for ForwardDiff, etc.).
I think it would. I think the only keyword left should be a typical input x
. The other options seem to belong and are part of the ADTypes: mode
for Enzyme, fdm
for FiniteDifferences, tag
and chunk
for ForwardDiff, and compile
for ReverseDiff. shadow
is a bit strange but I don't think anyone has ever used it and it could be constructed based on the typical x
, so I think it should be removed.
Yes you're right, shadow
was the only example in the category of "not backend
, not x
".
So if Tamas agrees, I guess the question is whether you want to deprecate Val
s by switching directly to DI, or first deprecate it on your own.
@gdalle: I would prefer to do it like this:
add an ADgradient
method that implements via DI. It should not dispatch on ADtypes though, the user should indicate that they want DI specifically. It is my expectation that in the long run, calling ADgradient
on ADtypes directly will dispatch to this method, but I want to keep this level of indirection. We can work out the syntax, suggestions welcome.
once that is in place, make current Val{}
methods forward to it everywhere it is applicable, after careful examination of each case. This would remove a lot of redudant code from this package, and make it easier to maintain, as you suggest.
@devmotion:
I think it would. I think the only keyword left should be a typical input x.
So the only use case for this is preparation? I will need some time to look into DI code to see what it does exactly: does it need a type (like a Vector
or SVector
, does the distinction matter), or a "typical" value, or something else? I am asking because LogDensityProblems
can supply some of that, ie problems know their input length.
I need some time to read up on this, I will be away from my computer for the weekend but I will get back to this topic on Tuesday.
Suggestions welcome. @gdalle, I appreciate your work a lot on DI and want to move forward with this, but I need to understand the details so that we can make a smooth transition, and for that I need time.
I expect that this package is not fully replaceable by DI, as it does a few extra things (again, a "problem" defined through this API knows about its dimension and AD capabilities), but I agree that we should remove redundancies.
add an ADgradient method that implements via DI. It should not dispatch on ADtypes though, the user should indicate that they want DI specifically. It is my expectation that in the long run, calling ADgradient on ADtypes directly will dispatch to this method, but I want to keep this level of indirection. We can work out the syntax, suggestions welcome.
Fair enough! How about the following toggle?
ADgradient(backend::AbstractADType, l, ::Val{DI}=Val(false); kwargs...) where {DI}
So the only use case for this is preparation?
DI's design is interesting because preparation is effectively unlimited. We can put whatever we want in the prep
object, as long as it speeds up gradient computations on similar inputs down the road. So we only need this one "use case" to cover everything the backends do: ForwardDiff configs, ReverseDiff tapes, FiniteDiff caches, Enzyme duplicated buffers, and so on.
See examples in the DI tutorial.
does it need a type (like a Vector or SVector, does the distinction matter), or a "typical" value, or something else?
It needs an actual value, because things like the size of the vector are also important (and they are usually not part of the type). You can read more about the preparation system in the DI docs.
@gdalle: I have read the DI docs and skimmed the source code. First, kudos on trying to organize all DI approaches into a coherent interface, it is a huge undertaking but should have a large payoff for the ecosystem in the long run.
I have some preliminary thoughs wrt to the interface of LogDensityProblems and DI.
First, in LogDensityProblems, the interface is focused on being functional:
the ℓ
argument can be assumed to have no state (unless explicitly requested, cf #8),
it can be called with arbitrary x
s as long as they are AbstractVector{<:Real}
, and the implementations have complete freedom. Calls are not enforced to be consistent, you can call it one moment with a Vector{Float64}
, then an SVector{3,Float32}
, etc (cf #3).
The interface has no API to handle situations when the caller promises to use the same argument types, or values, in exchange for a potential speed benefit.
I am entertaining the idea that we should expose "preparation" in the API (as defined by in the main interface package, LogDensityProblems.jl), where the caller promises to call the problem with the same argument type over and over, in exchange for speedups, and maybe preallocate stuff. The API should allow for querying the argument type above and whether the object is mutable (thread safety).
Once we implement that, we can flesh out the AD interface using DI and that API. That is to say, preparation would not be exposed via DI, but our own API that forwards to DI.
I am still thinking about the details but this is the general direction I am considering; I need to also understand sparse coloring its relation to preparation.
Calls are not enforced to be consistent,
This is a big difference indeed, and I understand why you would want to change your interface to accommodate it. Note that, at the moment, some backends already perform preparation when you pass x
, so I'm not sure what actually happens when you change the input type?
I need to also understand sparse coloring its relation to preparation.
Coloring is not relevant for gradients because a gradient is always dense (or you have some useless inputs) and can be computed in O(1) function calls-equivalents. Sparse AD is only useful when matrices are returned (Jacobians and Hessians).
Sorry I merged the review suggestions without checking typos, fixed now. The tests pass locally.
@devmotion, @gdalle: would a minor version bump be OK for this? After all, we just add new features, even though the change is extensive.
Yes, I think a minor release is appropriate here.
This PR adds a teeny tiny extension for DifferentiationInterface (#26). It can compute gradients for any
ADTypes.AbstractADType
that is not in the following list:AutoEnzyme
AutoForwardDiff
AutoReverseDiff
AutoTracker
AutoZygote
That way, your custom implementations remain the default, but for all other AD backends defined by ADTypes (and not symbols), DifferentiationInterface will kick in. This also allows you to gradually remove custom implementations in favor of DifferentiationInterface, if you so desire.
Ping @willtebbutt @torfjelde @adrhill
Note: since DI imposes Enzyme v0.13 in the tests, it may require merging #38 first.