Multi-pathfinder is not reproducible

sefffal commented 1 week ago

Hi @sethaxen , and thanks for the great package.

I noticed recently that multipathfinder with nruns>1 is not reproducible between repeated calls, regardless of seeding the global RNG or passing an rng argument.

I'm not sure how to go about minimizing this, but any help you could provide would be much appreciated!

Thanks, William

sethaxen commented 1 week ago

Thanks for the issue, @sefffal . Multi-path Pathfinder should indeed be reproducible with a seeded RNG. Could you provide a minimal non-reproducible example?

sefffal commented 1 week ago

Okay that's good to hear! In that case, it's probably user-error.

This is the call from within Octofitter:

result_pf = Pathfinder.multipathfinder(
      ldm, ndraws;
      nruns,
      init_sampler=init_sampler,
      progress=verbosity > 1,
      maxiters=25_000,
      reltol=1e-6,
      rng=rng,
      ntries=1,
      executor=Pathfinder.Transducers.PreferParallel(),
      optimizer=Pathfinder.Optim.LBFGS(;
          m=6,
          linesearch=Pathfinder.Optim.LineSearches.BackTracking(),
          alphaguess=Pathfinder.Optim.LineSearches.InitialHagerZhang()
      )
  )

I'll work on a runnable MWE.

sefffal commented 1 week ago

Here is a runnable MWE. I ran it on Julia 1.10, with 8 threads. It's reproducible without the PreferParallel executor.

using Octofitter, # Install `Octofitter#main`
      Distributions
astrom_like = PlanetRelAstromLikelihood(
    # Your data here:
    # units are MJD, mas, mas, mas, mas, and correlation.
    (epoch = 50000, ra = -505.7637580573554, dec = -66.92982418533026, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50120, ra = -502.570356287689, dec = -37.47217527025044, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50240, ra = -498.2089148883798, dec = -7.927548139010479, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50360, ra = -492.67768482682357, dec = 21.63557115669823, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50480, ra = -485.9770335870402, dec = 51.147204404903704, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50600, ra = -478.1095526888573, dec = 80.53589069730698, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50720, ra = -469.0801731788123, dec = 109.72870493064629, σ_ra = 10, σ_dec = 10, cor=0),
    (epoch = 50840, ra = -458.89628893460525, dec = 138.65128697876773, σ_ra = 10, σ_dec = 10, cor=0),
)
@planet b Visual{KepOrbit} begin
    a ~ Uniform(0, 100) # AU
    e ~ Uniform(0.0, 0.99)
    i ~ Sine() # radians
    ω ~ UniformCircular()
    Ω ~ UniformCircular()
    θ ~ UniformCircular()
    tp = θ_at_epoch_to_tperi(system,b,50000) # use MJD epoch of your data here!!
end astrom_like
@system Tutoria begin # replace Tutoria with the name of your planetary system
    M ~ truncated(Normal(1.2, 0.1), lower=0.1)
    plx ~ truncated(Normal(50.0, 0.02), lower=0.1)
end b
model = Octofitter.LogDensityModel(Tutoria)

result_pf = Pathfinder.multipathfinder(
    model, 1000;
    rng=Xoshiro(0),
    nruns=8,
    progress=true,
    maxiters=25_000,
    reltol=1e-6,
    ntries=1,
    executor=Pathfinder.Transducers.PreferParallel(),
    optimizer=Pathfinder.Optim.LBFGS(;
        m=6,
        linesearch=Pathfinder.Optim.LineSearches.BackTracking(),
        alphaguess=Pathfinder.Optim.LineSearches.InitialHagerZhang()
    )
);
result_pf.draws[end]

sethaxen commented 1 week ago

I can confirm that the MWE is not reproducible. This is strange, since we explicitly test for reproducibility: https://github.com/mlcolab/Pathfinder.jl/blob/01e594df950cdd949e14c40578a1258ec6adb1c5/test/multipath.jl#L71-L77.

My initial guess would be that your model has some state (e.g. a cache). In such cases, multithreading would be unsafe, since the state could be modified by any one of the threads, producing different results in each run. This is why the docstring explains that a multi-threading executor should only be used if the log-density function (or in your case, the model) is thread-safe. Do you think this could explain the problem here?

One way we could work around this internally would be to deepcopy the log-density function each time before calling single-path Pathfinder on it. I think there was a good reason we didn't do this (probably because it slowed things down quite a bit for fast models or uses up lots of memory for huge thread-safe models), but if this is the issue, we could maybe e.g. add a keyword to indicate the thread-safeness of the log-density function.

sefffal commented 1 week ago

Thanks @sethaxen, then it certainly sounds like an error on my part. Besides using Bumper.jl for some internal buffers (which is supposed to be thread-safe), the only thing I can think of is this section of the gradient calculation:

ForwardDiff.gradient!(diffresults[Threads.threadid()], ℓπcallback, ...)

I'll run some tests to confirm that the non-determinism is on my end, and is, I'll closeclose this issue. Thanks!

sefffal commented 1 week ago

Thanks @sethaxen , and apologies for the noise. My gradient calculation is indeed not thread safe. Cheers, and thanks again for Pathfinder!

mlcolab / Pathfinder.jl

Multi-pathfinder is not reproducible #219