Roundtrip unit test - Githubissues

hannorein commented 1 year ago

While coding up the roundtrip test from the paper as a unit test, I've noticed something that doesn't seem quite right. The error seems to depend on the range in a very non-smooth way (have a look at the output below). The plot in the notebook and paper doesn't have that many data points, so I'm not sure if this is a new problem. In any case, I don't think this is correct as of right now. I suspect this is related to how the reb_integrate function handles the last timestep, which depending on where the timesteps fall, might have to be much smaller than a "normal" timestep. (but it could also be something completely different)

trange = 41161.355954   error: 0.139017m
trange = 43219.423752   error: 2897.963781m
trange = 45380.394939   error: 0.061683m
trange = 47649.414686   error: 0.146603m
trange = 50031.885420   error: 0.314383m
trange = 52533.479691   error: 0.077800m
trange = 55160.153676   error: 0.495380m
trange = 57918.161360   error: 0.321240m
trange = 60814.069428   error: 4068.455833m
trange = 63854.772899   error: 0.308811m
trange = 67047.511544   error: 7.797359m
trange = 70399.887121   error: 0.589816m

dmhernan commented 1 year ago

Is the data above an extension of Fig. 2 in the paper? Those error jumps are by 1e4. It would be interesting to see all of that plotted. There are some jumps in error in the JPL data by two orders of magnitude, see below. I was thinking about this issue recently. Something useful could be to calculate the Lyapunov time of each trajectory.

158489.3 d: 1.213 0.803081617005794 km 199526.2 d: 1.570 0.330985838744210 km 251188.6 d: 1.857 1.23937541852833 km 316227.8 d: 2.150 144.538201415714 km 398107.2 d: 2.973 1.12990333629259 km 501187.2 d: 3.728 1.25118876023040 km 630957.3 d: 4.521 98.0357605187952 km 794328.2 d: 6.196 6.37728083404834 km

hannorein commented 1 year ago

I don't think it's related to whether the trajectory is chaotic or not. These are all the same trajectories, just some of them integrated a little longer.

I've done some more digging. I only see these spikes when using ASSIST. When I run an equivalent REBOUND simulation (9 planets and the test particle), it looks like expected. I've tried turning off individual force routines in ASSIST. But multiple combinations seem to show this issue, so it's not a bug in one of the force routines (but it could be a bug in ephem_all which is used by all routines).

Unknown

matthewholman commented 1 year ago

Just to confirm, you did turn off the ein_GR, right? I asked because, in addition to being very long (and thus a likelier source of bugs), in that routine we use the accelerations of bodies as computed from the chebyshev polynomials.

That said, we have a number of confirmations of the long term behavior of the code. I wonder if this might be a problem just with the output (or with the last step, as you suggested).

hannorein commented 1 year ago

Yes, I've simplified the test so that it now just includes the force from the Sun and I still see this. If I add the test particle to a normal REBOUND simulation with the Sun, I don't see this.

Given that we're looking at really small numbers here, do you think this could just be the finite precision of the ephemeris? What is the position error at any given time coming from the interpolation (this test doesn't care about the true solar system)?

matthewholman commented 1 year ago

I think the ephemeris is good to a part in 10^12, but I will check on that. Rob, @cylon359, do you happen to recall?

hannorein commented 1 year ago

Hm. This is now just one test particle in a time varying potential of another particle (the sun). Even if the sun's motion is completely unphysical, we should be able to integrate out and back as long as the sun doesn't make any jumps. Right? Do the ephemeris have jumps in them at the interfaces of the polynomials or are they smooth to some degree?

I need to think more about it. It could just be a bug somewhere...

cylon359 commented 1 year ago

I think the ephemeris is good to a part in 10^12, but I will check on that. Rob, @cylon359, do you happen to recall?

Yes, it should be good to one part in 10^11 or 10^12.

As for the jumps - it is possible there is a boundary issue in choosing which specific Chebyshev coefficients to use. If you can find a precise timestamp when the jump happens, we can check that.

matthewholman commented 1 year ago

That's a good point, Rob. The polynomials quickly blow up outside of their valid range.

hannorein commented 1 year ago

Thanks. I think that could be what's going on. If we integrate backwards, the timesteps will be almost the same as in the forward integration, but not exactly (because they are adaptive). If we're unlucky, we'll hit a boundary. That would also explain why the effect is more pronounced for longer integrations. In practice this might not matter much, but this test would be could at picking this up. I'll see if I can find a specific boundary where this happens...

hannorein commented 1 year ago

Two more arguments for why this might be what's going on:

I just reran the test without adaptive time stepping and the issue goes away.
@matthewholman, I think you didn't notice this when you made the original plot, because you used the assist_integrate function with a fixed interval of 20 days, thus effectively synchronizing the timesteps when going out and back.

dmhernan commented 1 year ago

@hannorein You could check if fixing the timestep resolves the issue.

dmhernan commented 1 year ago

oops! :)

dmhernan commented 1 year ago

@hannorein The plot you produced also suggests to me a bias may be present in ASSIST at large times. I attach a similar plot from Davide (not sure if on Github) in which the JPL method may also be transitioning to some bias at large times. Obviously, this is not necessarily even a problem, just a feature.

jpl_num_err

hannorein commented 1 year ago

I think this can safely be merged in now.

matthewholman commented 1 year ago

Sorry for the slow response. I believe I have been using a variable time step for all the recent tests, although there might have been an older test with fixed time steps in examples/simplest/problem.c.

hannorein commented 1 year ago

Yes, you used a variable timestep in ias15, but you also used a fixed interval in the wrapper assist_integrate function. The interval was comparable to the adaptively chosen ias15 timestep. This has the effect that you effectively synchronize the timesteps going forward and backward. The out and back timesteps never drift apart.

matthewholman commented 1 year ago

Oh, man! Thank you for catching that. I need to redo the corresponding figures.

hannorein commented 1 year ago

I dug a bit deeper to see if I can find any discontinuities near the polynomial boundaries. I can see a small jump if I sample the coordinates both just before and just after a boundary. But it's really small, close to floating point precision. I'm not sure how significant it is for the integration.

During the process, I noticed another issue. The following is a bit long. You don't need to read it or respond. It just helps me writing it down.

When integrating, REBOUND needs to somehow detect if the user wants to integrate forward or backward. As a first hint, it uses the sign of the timestep. So ideally, one should also switch the sign of the timestep when switching directions. If that doesn't happen (like in the original unit test), REBOUND eventually figures it out and changes the integration direction. But the first timestep after the switch will be very large. Usually, that's not a problem because the large timestep will just be rejected in an adaptive scheme. But it matters in this case because we're interested in such high precision and the predictor corrector loop will retain some memory of the too large timestep (because it's trying to predict values for the smaller timestep). There are two solutions: a) simply changing the sign of the timestep when integrating backwards, thus telling REBOUND explicitly which direction we want to integrate. b) somehow fixing the logic that determines automatically the direction in which REBOUND integrates. That sounds easy, but it really tricky because 1) The timestep can be fixed or adaptive. 2) REBOUND needs to reduce the final timestep so that the final time matches exactly the requested time (if exact_finish_time=1). 3) The final timestep might not actually be final because the step might get rejected. Then it needs to reduce the timestep once again. So what used to be one final step, might be many final steps. 4) After the final timestep(s) are done, it needs to reset the timestep to the previous value so that a simulation can continue with the normal, not reduced timestep. 5) REBOUND runs into floating point issues when the time variable is large compared to the timestep. That easily happens if one integrates for billions of timesteps. There is all kind of extra logic to prevent things from blowing up in that case. 6) All of this needs to be implemented so that it works when integrating backward or forward. Again sound easy because all that changes is a sign, but every < and > statement needs to be able to handle this.

I'll try to improve the current way this is implemented in REBOUND. But it's hard to foresee all the possible consequences. In the meantime solution a) significantly reduces the number of spikes we see in this test. It's not quite as good as using a fixed timestep, which I think makes sense. We'd expect a larger error when adaptive timesteps are turned on because the particle sees the planets at slightly different times on the way back, leading to slightly different errors than on the way out.

FWIW, similar to the argument before, @matthewholman you did not encounter this issue because you didn't change the direction of integration, you started a new integration when using assist_integrate the second time.

tldr; this is a complex test! It's good to find these kind of issues, but I also don't think any of this matters in 99% of applications where one just integrates in one direction.

dmhernan commented 1 year ago

@hannorein When you reverse the timestep sign, is the magnitude of spikes still ~4 orders magnitude?

As to how common these spikes are, and how problematic they are in the forward direction, I assume their rate of occurrence has a scaling with the timestep.

hannorein commented 1 year ago

@dmhernan This first plot shows what reversing the timestep sign does (old = not reversing timestep sign, new: reversing timestep sign)

This second plot shows the issue that still remains. And yes, the spikes are still large. For the run labeled "adaptive perturbed", I've perturbed the initial position by 1e-6 to see how sensitive the integration is. As you can see, the spikes appear first at the same time, but then appear randomly. I can't think of anything else other than the ephemeris being the origin for this. But I'd be happy to hear about other ideas!

hannorein commented 1 year ago

@dmhernan Results from one more experiment if you want to think further about it. In the green curve, I'm integrating out with one fixed timestep, then change it by ~10% and integrate back (choosing the timestep so that start and end time match perfectly). No more spikes. 🤷‍♂️

hannorein commented 1 year ago

Alright, so maybe this is just IAS15 not converging to machine precision in those cases. If I slightly reduce to epsilon=1e-10, then the issue also goes away.

hannorein commented 1 year ago

(Sorry for the flood of posts, don't feel obliged to follow)

Indeed, that seems to be what's going. Here is the same data, but also plotting the maximum adaptive timestep that occurred at some point during the integration. The spikes start exactly when the timestep increases. I'll need to look into why IAS15 thinks the timestep should be larger after some random time. No idea right now...

dmhernan commented 1 year ago

What a turn of events! If the time step guess is off, that seems to imply the error estimates are off. And the timestep problem persists even if all effects are removed except Newtonian gravity? Because if so, then the only difference with a usual simulation is the ephemeris issue. Could ephemeris error cause error estimates from the RK method to be off which then affects the timestep?

hannorein commented 1 year ago

Yes, the problem persist with only the direct forces. I don't see the issue if I run the same simulation but use real N-body particles to calculate the forces instead of the ephemeris data.

hannorein commented 1 year ago

Another thing that still doesn't make sense is why these are spikes. If this is due to the timestep being too large at some point, it should occur both in the out and back portions of the integration (50/50). But if this occurs on the outgoing path just once, then it should occur for all tranges larger than some critical value because all the outgoing integrations are identical. The plot should show the error jumping up after some critical trange and stay high because the error is accumulative. Instead we see spikes.

matthewholman commented 1 year ago

Sorry if you already mentioned this and I missed it, but do you see the same thing when the caching is turned off?

On Thu, Feb 2, 2023 at 9:29 AM Hanno Rein @.***> wrote:

Another thing that still doesn't make sense is why these are spikes. If this is due to the timestep being too large at some point, it should occur both in the out and back portions of the integration (50/50). But if this occurs on the outgoing path just one, then it should occur for all tranges larger than some critical value. The plot should show the error jumping up after some critical trange and stay high because the error is accumulative. Instead we see spikes.

— Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1413834467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5CVWKNWCFI6Y6G5GCOPTTWVPAGBANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Matthew J. Holman, PhD Senior Astrophysicist Center for Astrophysics | Harvard & Smithsonian 60 Garden Street, MS #51 Cambridge, MA 02138 (617) 496-7775

hannorein commented 1 year ago

Yes. I've tried it with caching off.

dmhernan commented 1 year ago

No ideas now. Of course, one could integrate backwards first and then forwards to check if the spike is associated with a problem in one of the time directions.

hannorein commented 1 year ago

We've just tried that. Looks the same.

hannorein commented 1 year ago

Progress: I've convinced myself that we should expect to see spikes rather than a plateau if IAS15 is in very rare cases choosing a timestep which is slightly too large. Here's why:

Say the chance to guess the timestep wrong once during a 1000 day integration is 1%. Then we will most likely not see an error in an outwards integration up to 1000 days because it's only a 1% chance. However, if we run 1000 different simulations (because we sample different tranges), we have 1000 different backwards integrations. So in total we should expect 10 spikes. These are spikes in the plot rather than a plateau because the error only occurs in the backwards trajectory (we have 1000 times as many backwards integrations as we have forward integrations). If trange is much larger, say 100*1000 days, then we have a ~100% chance of getting the timestep wrong in the outwards integration. Then we should see a plateau. That would explain why we sometimes see a plateau in rare case. In most cases the plateau is buried in noise, because the noise increases with trange whereas the height of the plateau stays the same.

It would also explain why this is not an issue in a normal integration that just goes in one direction. The error is so rare and so small, that it is buried in (the always increasing) numerical noise.

I think this explains everything. But I thought that before, so take it with a grain of salt.

matthewholman commented 1 year ago

I am very impressed with your detective work, @hannorein ! This does seems to explain most/all of the observed features of the issue. What do you think sets the ~1% scale for the rate of time step miscalculations?

hannorein commented 1 year ago

I'm not sure. Clearly having just one particle is a bit of an unusual setup that I've never tested before. Normally, if there are multiple particles, at least 2, then the chance for this to occur would be much smaller, ~0.01^N I guess.

It looks like the somewhat long timestep follows a somewhat short timestep. That would make sense because it will be very hard to predict the timestep accurately if the previous one was on the small side (less curvature to observe). In the IAS15, there is this safety_factor variable that determines how fast timesteps can grow/shrink from step to step. Currently this is set to 0.25 (increase/decrease by a factor of 4). It might make sense to set this closer to 0.5 or 0.75 for these high accuracy runs.

dmhernan commented 1 year ago

@hannorein Alternatively... how about perturbing initial conditions many times as well? If your explanation is correct, eventually one of those initial conditions will always give spiked forwards-backwards integration.

dmhernan commented 1 year ago

Or possibly changing the guess for the initial time step randomly? Not that these things need to be tested!

dmhernan commented 1 year ago

But @hannorein , does it make sense why the normal integrations that also evolve the planets never have spikes?

hannorein commented 1 year ago

"Normal" integrations have more than 1 particle, so the probability of getting the timestep slightly wrong is much lower. If the spikes are rare then they disappear into the floating point noise. That was kind of the idea of choosing the specific combination of epsilon and safety_factor. I'm sure we could find other cases where this becomes an issue but I think it works well in almost all cases. For this specific case, we just need to change epsilon or safety_factor a little bit.

Further evidence along those lines: I've just run a test with two test particles and the issue is almost completely gone. With three test particles, it is completely gone and the timestep remains very constant.

dmhernan commented 1 year ago

@hannorein Ah, thanks. So your two test particles must be interacting with each other. Conceptually, I understand that with more particles the timestep change would be smoother. Perhaps with the RK method one should be able to estimate the new time step, as IAS15 does, but it may also be possible to estimate a reasonable time step derivative by getting a timescale out of the positions, velocities, accelerations. This could be used to make sure the time step jumps are reasonable.

dmhernan commented 1 year ago

Hi, I'm curious if there was a resolution to the large time step jumps, or is this left for future work? For an analytic estimate of a time step jump, reasonable functional forms are in eq. (5) and (6) in Boekholt et al. (2022): https://arxiv.org/pdf/2212.09745.pdf

aryaakmal commented 1 year ago

I think this is addressed by @hannorein in integrate_or_interpolate?

On Mon, Feb 20, 2023 at 5:11 PM David M. Hernandez @.***> wrote:

Hi, I'm curious if there was a resolution to the large time step jumps, or is this left for future work? For an analytic estimate of a time step jump, reasonable functional forms are in eq. (5) and (6) in Boekholt et al. (2022): https://arxiv.org/pdf/2212.09745.pdf

— Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1437604897, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMJBQ2JETNYAIDSVJQQZ6DWYPTYNANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dmhernan commented 1 year ago

I looked into the function assist_integrate_or_interpolate(), and it seems it's used to produce output at specified times. I thought the problem described in this thread occurs even if we don't need outputs?

hannorein commented 1 year ago

I'm sorry, I don't understand what the issue is.

dmhernan commented 1 year ago

@hannorein , in the last plot of this thread, you showed the time step having large jumps during forward--backward integrations. I'm asking if something was changed to stop this, or is the current plan to ignore such error/time step jumps since they're rare?

hannorein commented 1 year ago

Let me know if I'm missing something, but I can't think of a use case where the results would be affected by the current timestep choice. We're talking about precisions of the order of a meter! But if you want a smaller timestep for whatever reason, you can just change the epsilon value, use a maximum timestep, or a fixed timestep.

dmhernan commented 1 year ago

I see, yes 1m precision would be irrelevant for asteroid Holman. But how about such jumps in error (by orders magnitude) affecting a measurement of chaos, for instance? I guess a user could investigate further and quickly determine the chaos is not real...

hannorein commented 1 year ago

Again, I fail to see how this could possibly affect anything. The Lyapnov timescale is probably millions of years. We're integrating for 100s of days with ASSIST.

dmhernan commented 1 year ago

One would need to be testing the limits of the code, like the long integrations in the paper (1e5 days), and consider asteroids with kyr Lyapunov times, like massive asteroids, for those scales to get comparable, if that's what you're getting at. Anyway, if the plan is to ignore the time step jumps, that seems reasonable to do.

hannorein commented 1 year ago

Sigh. Nothing is getting ignored. I just don't believe there is any issue to begin with. If you think otherwise, please provide some evidence.

dmhernan commented 1 year ago

Hey @hannorein, the only thing I have is that if a user is integrating two nearby asteroids, and the difference in their positions jumps by four orders of magnitude, due to a time step jump, they may be confused as to where this divergence is coming from. If it's an issue that has confused me (for instance in Fig. 2 in the paper, perhaps a time step jump is responsible for the penultimate point), and confused us in this thread, it could perhaps confuse other researchers.

aryaakmal commented 1 year ago

@dmhernan - Do you mean figure 6?

On Tue, Feb 21, 2023 at 12:28 PM David M. Hernandez < @.***> wrote:

Hey @hannorein https://github.com/hannorein, the only thing I have is that if a user is integrating two nearby asteroids, and the difference in their positions jumps by four orders of magnitude, due to a time step jump, they may be confused as to where this divergence is coming from. If it's an issue that has confused me (for instance in Fig. 2 in the paper, perhaps a time step jump is responsible for the penultimate point), and confused us in this thread, it could perhaps confuse other researchers.

— Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1438851029, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMJBQ5SQ6QC4HJ3IVGSSUDWYT3MFANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you commented.Message ID: @.***>

matthewholman / assist

Roundtrip unit test #43