results depend on integration time

nchoksi4456 commented 3 years ago

I've been simulating two-planet systems and noticed in some cases that I get totally different evolutions depending on the max time I run for. For example, the following code:

sim.automateSimulationArchive("test1.bin", interval=1*cgs.yr,deletefile=True) 
sim.move_to_com()
max_time = 3e4*cgs.yr  # Run for 3e4 yr 
Noutputs = int(( max_time/dt)) # I've been using dt = 1 yr as a default
times = np.linspace(0,  max_time, Noutputs)
for i,tv in enumerate(times): 
      sim.integrate(tv, exact_finish_time = 0)

produces plot test1 below.

But if I halve the runtime to max_time = 1.5e4*cgs.yr (all else left fixed), I get the drastically different evolution seen in test2. For example, in test1, the two planets collide with each other at ~7e3 yr (you can see the pink curve cuts off at that time), and in test2 one of the planets collides with the star at 1e4 yr.

In case it is relevant, I'm using IAS15 and including damping via a (modified) version of your modify_orbits_forces script.

dtamayo commented 3 years ago

Hi Nick,

This is a good question that a lot of people run into. The issue is that when planets are colliding the evolution is chaotic, which means that any change can lead to completely different outcomes. In this case, the fact that your two simulations stop at different arrays of times, means that differences at machine precision grow exponentially to change the phases and times of collision in the two simulations.

If you want to be able to get the exact same trajectory, this is where SimulationArchives are very helpful (look at the two examples in ipython_examples). The idea is that you run a simulation saving snapshots in a simulation archive (and don't take any output). Then when you want to analyze the simulation, you can get output at arbitrary times, and you're always referencing the same trajectory (since you're loading the nearest snapshot of the exact same chaotic realization each time, and integrating to your output time without there being enough time for chaos to completely change the answer). See Rein & Tamayo 2016 for more details.

Once you start using them you'll never go back :)

hannorein commented 3 years ago

I don't think this is correct. Nick is using exact_finish_time = 0 so the times where he samples the simulation should not matter. To say more, I'd need a working example.

hannorein commented 3 years ago

Here's a simple example. Both runs return exactly the same position as expected. It might be that there's something else in Nicks code which breaks reproducibility.

import rebound
import rebound.data
sim = rebound.Simulation()
rebound.data.add_outer_solar_system(sim)
sim.integrate(1,exact_finish_time=0)
sim.integrate(2,exact_finish_time=0)
"%.20f"%sim.particles[1].x

sim = rebound.Simulation()
rebound.data.add_outer_solar_system(sim)
sim.integrate(2,exact_finish_time=0)
"%.20f"%sim.particles[1].x

nchoksi4456 commented 3 years ago

Hi Hanno and Dan. I've uploaded a full working example "test.py" here as requested. One issue is that I am using my own version of modify_orbits_forces -- I've also included the script "df2_full.c" for that here but you'll just have to attach it to REBOUNDx on your machines to run "test.py" (i.e., add to "core.c" and "core.h" in REBOUNDx, etc.). (BTW, the point of this revised modify_orbits_forces is to implement a Chandrasekhar-like dynamical friction based on the local gas properties.)

Thanks so much for your help!

nchoksi4456 commented 3 years ago

Hi Hanno and Dan. FYI the "test.py" script I uploaded ~40 minutes ago used the wrong random seed. It should be np.random.seed(0) to get the same initial conditions as I've been using. I updated the file at the link with the correct random seed, but just in case you already downloaded the script already you'll want to make this change. Sorry for the confusion!

hannorein commented 3 years ago

I did a quick test without REBOUNDx. That leads to reproducible results for me, independent on max_time. @dtamayo, can you think of anything that REBOUNDx does to make it non-reproducible?

dtamayo commented 3 years ago

Oops, thanks @hannorein ! I don't see anything obvious in your code Nick. I made a similar test but using the actual modify_orbits_forces routine, and got reproducible results independent of max time, so I don't think it's something intrinsic to REBOUNDx.

My advice would be to start from a similar test case --changing what you have to using a simple toy case with "modify_orbits_forces" with a real 'tau_a', and once you verify that that test case gives you reproducible results independent of tmax, then switch in your new function one step at a time, e.g., switch to your function, but at first comment out everything in the function so that it isn't doing anything, then add in blocks of your function one at a time to isolate which one is causing the problem.

I'm happy to help more with what you find!

hannorein commented 3 years ago

Note that you don't need to run it for very long or need to do any plots. Just compare any coordinate of any particle at any given time. The coordinates should be identical to the last digit. As soon as you see a difference in the last digit, you can stop (the differences will grow exponentially from thereon, as Dan explained).

nchoksi4456 commented 3 years ago

Thanks Hanno and Dan, I'll try out these suggestions. Could I ask one more question right now, which might or might not be related? To record output, I was using automateSimulationArchive with a fixed interval = 100 yr. And yet, I noticed that in the 1.5e4 yr run that the output intervals of the simulationArchive change after one of the planets collides with the star. The plot below shows the # of planets vs. time, and you can see that the outputs become more sparse after the collision (before the collision all the outputs are spaced at 100 yr intervals as they should be). Do you know why this happens?

hannorein commented 3 years ago

Is the semi-major axis so large that the timestep is larger than 100yr?

nchoksi4456 commented 3 years ago

Thanks Hanno, that's probably it... a(t) is wild.

hannorein commented 3 years ago

Keep us posted! I'm curious to find out what the issue is. Just to make sure: are you using the latest version of REBOUND? Did you make any modifications?

nchoksi4456 commented 3 years ago

Will do! And yes, I'm using the latest version of REBOUND, but haven't made any changes to it.

nchoksi4456 commented 3 years ago

Here's another diagnostic test I tried which continues to add to my confusion. If I just replace

Noutputs = int(( max_time/dt)) # I've been using dt = 1 yr as a default
times = np.linspace(0,  max_time, Noutputs)
for i,tv in enumerate(times): 
      sim.integrate(tv, exact_finish_time = 0)

with

times = np.arange(0,  max_time, step = 1*yr)
for i,tv in enumerate(times): 
      sim.integrate(tv, exact_finish_time = 0)

The results don't depend on integration time...

hannorein commented 3 years ago

Hm. Strange. Just to make sure: Are the timesteps always shorter than the output interval? If not, you might end up giving sim.integrate() a time in the past (and it will integrate backwards). Maybe add a check in your loop to make sure?

if tv<sim.t:
    raise RuntimeError("requested time is in the past")

danielk333 commented 3 years ago

Hm, i got curious since I always use np.arange to ensure predicable time-steps without any real reason (just something in the back of mind that tells me i should) so i wrote the following short test and this is the result:

import numpy as np

yr = 3600.0*24*365.25
dt = 1*yr

def time_diff(max_time1, max_time2):

    times_ln1 = np.linspace(0, max_time1, num=int(max_time1/dt), dtype=np.float64)
    times_ln2 = np.linspace(0, max_time2, num=int(max_time2/dt), dtype=np.float64)
    times_ar1 = np.arange(0,  max_time1, step = dt, dtype=np.float64)
    times_ar2 = np.arange(0,  max_time2, step = dt, dtype=np.float64)

    tdf1 = times_ln1 - times_ln2[:len(times_ln1)]
    tdf2 = times_ar1 - times_ar2[:len(times_ar1)]
    return tdf1, tdf2

tdf1, tdf2 = time_diff(1.5e4*yr, 3e4*yr)
print(tdf1)
print(tdf2)

With output (on numpy==1.20.0)

[0.00000000e+00 1.05202520e+03 2.10405040e+03 ... 1.57772219e+07
 1.57782740e+07 1.57793260e+07]
[0. 0. 0. ... 0. 0. 0.]

nchoksi4456 commented 3 years ago

Hanno, you were right that the output timestep was less than the simulation timestep -- your suggested code caught that, thanks. But (somewhat surprisingly) this doesn't seem to fix the issue of results depending on max_time. I modified your code as follows

for i,tv in enumerate(times):
    if tv<sim.t:
        print("Requested time is in the past")
        continue
    sim.integrate(tv, exact_finish_time = 0)

The results remain the same even with the if statement included (that is, I still get different evolutions depending on max_time). I think this is actually expected, since exact_finish_time = 0. See the plots below (edit: y-axis on plots is eccentricity):

Daniel -- thanks for your post. I agree that tdf1 should != 0, whereas tdf2 should == 0 always, makes sense. But I think this doesn't explain the issue I'm finding either (since exact_finish_time = 0).

hannorein commented 3 years ago

Maybe this is obvious: The if statement doesn't solve the issue of integrating backwards. It just reports it.

Are you saying in the new tests, the if statement never triggers but you still get different results?

nchoksi4456 commented 3 years ago

I modified your if statement (see the code snippet in my last post). The new if statement avoids integrating backwards by continuing to the next iteration of the loop (until sim.t < requested time) instead of raising an exception. Edit: In the new tests, the if statement is triggered for the run with max_time = 1.5e4 yr, but not for max_time = 3e4 yr. But I still get different results.

hannorein commented 3 years ago

Oops. I missed that. Sorry!

hannorein commented 3 years ago

I still have a feeling this is somehow related to the issue. What happens if you run it with max_time = 6e4 yr?

nchoksi4456 commented 3 years ago

For 6e4 yr, the evolution matches the evolution for 1.5e4 yr. So to summarize: 1.5e4 yr and 6e4 yr match, and in both cases the if statement gets invoked. 3e4 yr has a different evolution and the if statement never gets invoked. Weird...

hannorein commented 3 years ago

Weird indeed. I'll put my thinking cap on.

nchoksi4456 commented 3 years ago

Below is a plot which is pretty striking. First a little description: I've basically been finding two different evolutions -- call them evolutions A and B (corresponding to max_time = 1.5e4 yr and = 3e4 yr, respectively; see the plots above). In both evolution A and B there is a collision, but under totally different circumstances (in A one planet collides with the star at time ~10,200 yr, in B the two planets collide with each other at 6500 yr).

In the plot below, I'm just showing the collision time vs. max_time. You can see that every run I tried gives t_coll ~ 10,200 yr (corresponding to "evolution A"), except for the case where max_time = 3e4 yr, in which case t_coll = 6500 yr (evolution B).

So, it seems like there is something "special" about 3e4 yr!

hannorein commented 3 years ago

I've integrated two simulations for some ~7200 yrs, one with max_time=3e4yr_cgs and 1.5e4yr_cgs. Both give exactly the same answer down to the last digit. According to your plot, one should have had a collisions.

Can you triple check that you have the latest version of REBOUND and REBOUND (the ones on github's main branch). For reference, here is the code I've used:

yr_cgs = 3.154e7
G_cgs=  6.67e-8
Msun_cgs = 1.988e33    
Rjup_cgs=  6.99e9  
Mjup_cgs = 1.898e30
AU_cgs = 1.496E13 

dt_output = 1*yr_cgs
np.random.seed(0)

sim = rebound.Simulation()
sim.integrator = "ias15"; sim.G = G_cgs; sim.collision = 'direct'; sim.collision_resolve = 'merge'

rebx = reboundx.Extras(sim)
mof = rebx.load_force("df2_full")
rebx.add_force(mof)

sim.add(m=Msun_cgs)

a0 = 1.58*AU_cgs
m1 = 1*Mjup_cgs; 
e0 = 0.9; i0 = 0.0
f0 = np.random.uniform(-np.pi, np.pi)
sim.add(m=m1, r = Rjup_cgs, a=a0,e=e0, inc=i0, Omega = 0.0, omega=0, f=f0)
torb = 2*np.pi*np.sqrt(a0**3/(G_cgs*Msun_cgs))

a2 = 6*AU_cgs 
m2 = 1*Mjup_cgs; 
e2 = 0.9; i2 = 0.0
f2 = np.random.uniform(-np.pi, np.pi) 
sim.add(m=m2, r = Rjup_cgs, a=a2,e=e2, inc=i2, Omega = 0, omega=np.pi, f=f2)

particles = sim.particles
particles[1].params["tau_a"] = 3e-11 # Kludgey way to set the gas density in my modified modify_orbits_forces (didn't know how to add another param for rho_gas)
particles[2].params['tau_a'] = 3e-11

min_time = 0
max_time = 1.5e4*yr_cgs
#max_time = 3e4*yr_cgs
Noutputs = int(( (max_time - min_time)/dt_output)) # Output every 0.1 yr
times = np.linspace(min_time,  max_time, Noutputs)

sim.automateSimulationArchive("test1.bin", interval=100*dt_output,deletefile=True)
sim.move_to_com()
for i,tv in enumerate(times):
    sim.integrate(tv, exact_finish_time = 0)
    if sim.t>7000*yr_cgs:
        break
sim.integrate(7200*yr_cgs, exact_finish_time = 0)
print(sim.particles[1].x)

nchoksi4456 commented 3 years ago

Thanks Hanno. I played around with different versions of REBOUND, both on my laptop and desktop (all my runs the past few days have been on the desktop) and found some interesting things:

(1)On the desktop: I uninstalled REBOUND and re-installed the latest version (3.14.0). When I did this, I still got discrepant results. (2)Also on the desktop: I installed version 3.12.3 of REBOUND (which is what I happened to have on my laptop, so I just copied that over). With version 3.12.3, I got identical results independent of max_time, up to machine precision. Hooray!

(3)On my laptop, either version of REBOUND above gives identical results regardless of max_time.

I'm not sure how to explain this behavior, but I guess I'll take it and just keep using version 3.12.3 of REBOUND on the desktop, since that seems to work...

hannorein commented 3 years ago

It can sometimes be tricky to tell which version of rebound and reboundx python is using. Especially if you've compiled reboundX from scratch. The shared library files get installed in different directories depending on how you install it (pip versus make. It could be that you're rebound and reboundx libraries don't match. There are three variables in the rebound module (similar in reboundx) to help you identify that you're using the correct version of the shared library:

rebound.__version__
rebound.__build__
rebound.__githash__

nchoksi4456 commented 3 years ago

Here's the setup that worked on my desktop:

>>> rebound.__version__
'3.12.3'
>>> rebound.__build__
'Feb  3 2021 02:03:53'
>>> rebound.__githash__
'faedbed16c82f710f2550ad1fc0ebcfce1b5de6f'
>>> reboundx.__version__
'3.1.0'
>>> reboundx.__build__
'Feb  3 2021 02:04:08'
>>> reboundx.__githash__
'bd321ad9b705631863df214d749a04ecbb44670e'

and the setup that didn't work (discrepant results):

>>> import rebound; import reboundx;
>>> rebound.__version__
'3.14.0'
>>> rebound.__build__
'Feb  3 2021 10:48:32'
>>> rebound.__githash__
'd7902091b19d807cadea5890d27ba5cc9d856633'
>>> reboundx.__version__
'3.2.0'
>>> reboundx.__build__
'Feb  3 2021 10:48:51'
>>> reboundx.__githash__
'8c5299d810408b94d7a65bc441d3e10e5118dda2'
>>>

hannorein commented 3 years ago

I'm at a loss. And the only thing you've changed in REBOUNDx was to copy and paste the code you've linked above into REBOUNDx's core.c file? I assume you also added something like this:

else if (strcmp(name, "df2_full") == 0){
        force->update_accelerations = rebx_df2_full;
        force->force_type = REBX_FORCE_VEL;
}

(the REBX_FORCE_VEL part is important)

nchoksi4456 commented 3 years ago

yup, that's right...

nchoksi4456 commented 3 years ago

Out of curiosity, I added a print statement: printf(dt_done) into the ias15_integrator.c script to inspect what was going on under the hood and confirmed that max_time = 1.5e4 yr and = 3e4 yr results in IAS15 using a different timestep. Interestingly, the timesteps are identical (down to the last digit) up until t = 6538 yrs, after which they diverge.

dta  = np.loadtxt('nohupA.out', unpack = True , usecols = [0]) # File with every timestep from the 1.5e4 yr run
ta = np.cumsum(dta)

dtb  = np.loadtxt('nohupB.out', unpack = True , usecols = [0]) # Same for 3e4 yr run
tb = np.cumsum(dtb)

dt_ratio = dta[0:len(dtb)]/dtb
print(np.amin(tb[dt_ratio != 1.0]/cgs.yr))
[6538.22308344]

hannorein commented 3 years ago

Ok. Great. So something is happening around that time. I would normally just do some sort of bisection debugging now, trying to get closer and closer to the issue. But I can't do that if I can't reproduce the issue myself. So I'm not sure what to suggest.

Side note: I see that you're using the pow() function in your code. That is not machine independent and might change from compiler to compiler or operating system to operating system. This doesn't explain why the results depend on max_time, but it might explain why I can't reproduce it.

nchoksi4456 commented 3 years ago

Hi Hanno, here's a summary of what I understand so far, and some new tests I tried:

(1) without REBOUNDx, timesteps (=dt_done in IAS15 C file) are independent of max_time (2) with REBOUNDx and Dan's modify_orbits_forces, timesteps are independent of max_time (3) with REBOUNDx and my modified script df2_full.c, timesteps depend on max_time on my desktop but not on my laptop. I re-installed python/numpy from scratch (different versions, too) and that didn't make a difference.

The desktop is a Linux machine, my laptop is Mac. I wonder if the problem is associated with that. The changes I made to modify_orbits_forces in my df2_full script were pretty light, so I am still at a loss for what could cause this (plus, as I mentioned, things work fine on my Mac).

Good to know about pow() !

hannorein commented 3 years ago

I've finally figured this one out. 🎉

In short:

Before doing any timesteps, the integrate() function sets dt_last_done to zero. This has the effect that if the very first timestep in integrate() fails, it won't use any old values to predict the new positions but starts from scratch.
In your simulation, at some point an integration step failed. This happens very rarely. If it does happen, the code is redoing the step with a slightly smaller timestep. Now if the step that fails is the first in the integrate() call, then it won't use any previous information. but if the step is somewhere in the middle of the integrate() call, then it will. That's why the maximum integration time mattered in your case.
This is a tiny difference at the level of floating point precision and does not affect the precision. But because the system is chaotic and any small change grows exponentially, it break reproducibility between the two different simulations.
It's a very rare occurrence. That's why absolutely everything had to match to reproduce it.
All the differences you observed between operating systems is probably due to slightly different implementations of the pow() function. Again, we needed to recreate the exact case where a step fails in the first call to integrate() to reproduce this.

I'm not sure if this qualifies as a bug. But I guess it would be great to avoid this issue in the future. One could just comment out this line. I need to think a bit about any possible negative side-effects.

nchoksi4456 commented 3 years ago

Awesome! Thanks so much Hanno. Out of curiosity, what causes an integration step to fail?

hannorein commented 3 years ago

If the forces change more rapidly than predicted, for example due to a close encounter or some rapidly changing external forces.

Thanks for helping me to debug this. I was getting worried this could be something more serious. I'm glad it was just a very specific edge case.

hannorein commented 3 years ago

Mainly a mental note for myself: I'm still thinking about whether this is something that should/needs to be fixed. One possible solution is to move the prediction of new e/b coefficients from the end of each timestep to the beginning of each step. This way the integrator does not need to know the step size of the next step, only the size of the previous step. But the number of logic lines which would need to change in the code is too much for my comfort. From past experience, there are so many rare edge-cases to consider (just as this one)...

hannorein / rebound

results depend on integration time #485