Open pierreguilmin opened 3 days ago
I get them to be a lot closer by using UnsafeBrownianPath
, which has less overhead than VBT. Diffrax is still a bit slower with this change on my machine, but the difference is smaller (and probably due to other overheads that diffrax does to enable more features).
There's also some risky (but often useful) changes to UBP we've made internally that I've been meaning to put in the fork, so you can definitely do a fair amount with modifications to UBP (being able to get through all 3 stated requirements).
Yup, VBT is often the cause of poor SDE performance. Really we need some kind of LRU caching to make it behave properly, but that doesn't seem to be easy in JAX -- I'm pretty sure it'd require both a new primitive ('cached_call_p
') and a new transform. That's a fairly advanced project for someone to take on!
In the meantime I recommend UBP as the go-to for these kinds of normal 'just solve an SDE' applications.
I think a lot of people get turned off by the Unsafe
in the name, maybe worth adding a sentence like this to the docs ("In the meantime I recommend UBP as the go-to for these kinds of normal 'just solve an SDE' applications.").
Thanks. Indeed using UBP does help but I understand it's quite restricted in terms of usage.
Diffrax is still a bit slower with this change on my machine, but the difference is smaller (and probably due to other overheads that diffrax does to enable more features).
It seems there is still a factor ~10-20 difference (irrespective of number of time steps) between the homemade solver and diffrax with UBP. I would have naively thought that any irrelevant computation would be jitted away. Could you elaborate on what diffrax with UBP does compared to the naive solver?
Diffrax (VBT): 7.51 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Diffrax (UBP): 637 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Naive: 28.5 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Diffrax has a lot more checking/shaping/logging than the default implementation. You can see it reflected in the jaxprs:
I believe most of this comes from the UBP, since if I do
@jax.jit
def homemade_simu():
ts = jnp.linspace(t0, t1, ndt)
def step(y, t):
dw = brownian_motion.evaluate(t, t + dt)
dy = drift(None, y, None) * dt + diffusion(None, y, None) * dw
return y + dy, y
return jax.lax.scan(step, 1.0, ts)[-1]
I see the times are pretty much the same. Perhaps this does indicate that there is room for cutting down the speed costs of the UBP related overhead.
FWIW I think the speed difference here does seem unacceptably large. This seems like it should be improved.
Starting with the low-hanging fruit to be sure we're doing more of an equal comparison: can you try setting EQX_ON_ERROR=nan
and diffeqsolve(throw=False)
, to disable all error checks. Those are fairly slow.
Also, can you try using stepsize_controller=StepTo(...)
. By default Diffrax does not recompile if the number of steps changes (e.g. because t1
changes), but a lax.scan
implementation does. Diffrax pays a small amount of runtime cost for this generality. Using StepTo
instead bakes in the discretisation in the same way as a lax.scan
.
With throw=False, EQX_ERROR=NAN and step to, this is what I see
(diffrax top, custom bottom)
2.18 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 109 µs ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(without any of those things I had):
2.43 ms ± 666 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 110 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(all on CPU, just a slower CPU, but the 20-30x slowdown seems of the same scale)
So you definitely don't want DirectAdjoint
: this is actually really slow and should be avoided if possible. (It exists to handle some autodiff edge cases, I'd love to remove it sometime...) Use the default instead.
Make sure you include an argument (say y0
) to both jitted functions -- XLA may have different behavior around constant folding.
I'd also try with and without SaveAt(steps=True)
. (And adjusting the scan appropriately.) I think this should be equivalent either way but I'm not 100% certain.
With all of the above in, then at that point there shouldn't actually be that much difference between the two implementations. (And if there is then we should figure out what.)
The default actually errors with UBP which is why I changed to direct adjoint
ValueError: `adjoint=RecursiveCheckpointAdjoint()` does not support `UnsafeBrownianPath`. Consider using `adjoint=DirectAdjoint()` instead.
Ah, right. I've just checked and in the case of an unsafe SDE we do actually arrange for DirectAdjoint
to do a scan so that should be fine:
(In retrospect I think we could have arranged for the default adjoint to also do the same thing, that might be a small usability improvement.)
Anyway, that's everything off the top of my head -- I might be forgetting something but with these settings then I think Diffrax should be doing something similar to the simple lax.scan
implementation. But clearly we're missing something!
(EDIT: we still have one discrepancy I have just noticed: generating the Brownian samples in advance vs on-the-fly.)
If you'd like to dig into this then it might be time to stare at some jaxprs or HLO for the two programs. If you want to do this at the jaxpr level then you might find eqxi.finalise_jaxpr
(and friends) to be a useful set of tools here:
https://github.com/patrick-kidger/equinox/blob/main/equinox/internal/_finalise_jaxpr.py
Many primitives exist just to add e.g. an autodiff rule, so we can simplify our jaxprs down to what actually gets lowered by ignoring that and tracing through their impl rules instead.
DirectAjoint does slow things down, but not all the way. If I switch to a branch that allows for UBP + recursive adjoint, it's faster but still around ~4x gap. If I account for the fact that UBP has to split keys but the other doesn't, I get the gap to be around ~1.1-1.2 (which maybe isn't ideal, but seems much more reasonable to me given there's probably some other if statements/logging that might exist).
x = Timer(lambda : diffrax_simu(y0).block_until_ready())
print(x.timeit(number=100))
x = Timer(lambda : homemade_simu(y0).block_until_ready())
print(x.timeit(number=100))
with (above things, NAN, steps, function input, stepto, max steps, etc. all that) and direct adjoint: 0.002462916076183319 0.0005935421213507652
w/ checkpoint adjoint (on an internal branch that had some UBP changes to work with checkpoint): 0.002062791958451271 0.0005716248415410519
w/ both splitting keys: 0.0019747079350054264 0.001669874880462885
(code changed to:
@jax.jit
def homemade_simu(yy):
def step(y1, dW):
y, k = y1
k, subkey = jax.random.split(k)
dw = jnp.sqrt(dt) * jax.random.normal(subkey)
dy = drift(None, y, None) * dt + diffusion(None, y, None) * dw
return (y + dy, k), y
return jax.lax.scan(step, (yy, key), steps)[-1]
)
Aha, interesting! Good to have more-or-less gotten to the bottom of the cause of this.
So:
RecursiveCheckpointAdjoint
does, and how that compares to the unsafe-SDE-branch of DirectAdjoint
.UnsafeBrownianPath
could make it possible to precompute things.On point 2, I suspect the solution may require allowing the control to have additional state. (Which is also what we'd need to make VBT faster.) Perhaps it's time to bite that bullet and allow for that to happen. Happy to hear suggestions on this one!
t
, not the step index. We'd also have to have a way to pass the number of steps etc to the control. FWIW I'd probably lean towards not having a flag and just always doing this when possible.AbstractSolver.step
also accept the control state, and then pipe it through appropriately. Then also return the updated state. Unfortunately I think we're looking at a hard break to both the control and the solver APIs here, but c'est la vie.
Hello,
When solving the (trivial) SDE $d y_t = -y_t\ dt + 0.2\ dW_t$, the Diffrax Euler solver is ~200x slower than a naive for loop. Am I doing something wrong? The speed difference is consistent across various SDEs, solvers, time steps
dt
, and number of trajectories, and it appears to be specific to SDE solvers.