Closed hannorein closed 1 year ago
Is the data above an extension of Fig. 2 in the paper? Those error jumps are by 1e4. It would be interesting to see all of that plotted. There are some jumps in error in the JPL data by two orders of magnitude, see below. I was thinking about this issue recently. Something useful could be to calculate the Lyapunov time of each trajectory.
158489.3 d: 1.213 0.803081617005794 km 199526.2 d: 1.570 0.330985838744210 km 251188.6 d: 1.857 1.23937541852833 km 316227.8 d: 2.150 144.538201415714 km 398107.2 d: 2.973 1.12990333629259 km 501187.2 d: 3.728 1.25118876023040 km 630957.3 d: 4.521 98.0357605187952 km 794328.2 d: 6.196 6.37728083404834 km
I don't think it's related to whether the trajectory is chaotic or not. These are all the same trajectories, just some of them integrated a little longer.
I've done some more digging. I only see these spikes when using ASSIST. When I run an equivalent REBOUND simulation (9 planets and the test particle), it looks like expected. I've tried turning off individual force routines in ASSIST. But multiple combinations seem to show this issue, so it's not a bug in one of the force routines (but it could be a bug in ephem_all which is used by all routines).
Just to confirm, you did turn off the ein_GR, right? I asked because, in addition to being very long (and thus a likelier source of bugs), in that routine we use the accelerations of bodies as computed from the chebyshev polynomials.
That said, we have a number of confirmations of the long term behavior of the code. I wonder if this might be a problem just with the output (or with the last step, as you suggested).
Yes, I've simplified the test so that it now just includes the force from the Sun and I still see this. If I add the test particle to a normal REBOUND simulation with the Sun, I don't see this.
Given that we're looking at really small numbers here, do you think this could just be the finite precision of the ephemeris? What is the position error at any given time coming from the interpolation (this test doesn't care about the true solar system)?
I think the ephemeris is good to a part in 10^12, but I will check on that. Rob, @cylon359, do you happen to recall?
Hm. This is now just one test particle in a time varying potential of another particle (the sun). Even if the sun's motion is completely unphysical, we should be able to integrate out and back as long as the sun doesn't make any jumps. Right? Do the ephemeris have jumps in them at the interfaces of the polynomials or are they smooth to some degree?
I need to think more about it. It could just be a bug somewhere...
I think the ephemeris is good to a part in 10^12, but I will check on that. Rob, @cylon359, do you happen to recall?
Yes, it should be good to one part in 10^11 or 10^12.
As for the jumps - it is possible there is a boundary issue in choosing which specific Chebyshev coefficients to use. If you can find a precise timestamp when the jump happens, we can check that.
That's a good point, Rob. The polynomials quickly blow up outside of their valid range.
Thanks. I think that could be what's going on. If we integrate backwards, the timesteps will be almost the same as in the forward integration, but not exactly (because they are adaptive). If we're unlucky, we'll hit a boundary. That would also explain why the effect is more pronounced for longer integrations. In practice this might not matter much, but this test would be could at picking this up. I'll see if I can find a specific boundary where this happens...
Two more arguments for why this might be what's going on:
assist_integrate
function with a fixed interval of 20 days, thus effectively synchronizing the timesteps when going out and back.@hannorein You could check if fixing the timestep resolves the issue.
oops! :)
@hannorein The plot you produced also suggests to me a bias may be present in ASSIST at large times. I attach a similar plot from Davide (not sure if on Github) in which the JPL method may also be transitioning to some bias at large times. Obviously, this is not necessarily even a problem, just a feature.
I think this can safely be merged in now.
Sorry for the slow response. I believe I have been using a variable time step for all the recent tests, although there might have been an older test with fixed time steps in examples/simplest/problem.c.
Yes, you used a variable timestep in ias15, but you also used a fixed interval in the wrapper assist_integrate
function. The interval was comparable to the adaptively chosen ias15 timestep. This has the effect that you effectively synchronize the timesteps going forward and backward. The out and back timesteps never drift apart.
Oh, man! Thank you for catching that. I need to redo the corresponding figures.
I dug a bit deeper to see if I can find any discontinuities near the polynomial boundaries. I can see a small jump if I sample the coordinates both just before and just after a boundary. But it's really small, close to floating point precision. I'm not sure how significant it is for the integration.
During the process, I noticed another issue. The following is a bit long. You don't need to read it or respond. It just helps me writing it down.
When integrating, REBOUND needs to somehow detect if the user wants to integrate forward or backward. As a first hint, it uses the sign of the timestep. So ideally, one should also switch the sign of the timestep when switching directions. If that doesn't happen (like in the original unit test), REBOUND eventually figures it out and changes the integration direction. But the first timestep after the switch will be very large. Usually, that's not a problem because the large timestep will just be rejected in an adaptive scheme. But it matters in this case because we're interested in such high precision and the predictor corrector loop will retain some memory of the too large timestep (because it's trying to predict values for the smaller timestep). There are two solutions: a) simply changing the sign of the timestep when integrating backwards, thus telling REBOUND explicitly which direction we want to integrate. b) somehow fixing the logic that determines automatically the direction in which REBOUND integrates. That sounds easy, but it really tricky because 1) The timestep can be fixed or adaptive. 2) REBOUND needs to reduce the final timestep so that the final time matches exactly the requested time (if exact_finish_time=1). 3) The final timestep might not actually be final because the step might get rejected. Then it needs to reduce the timestep once again. So what used to be one final step, might be many final steps. 4) After the final timestep(s) are done, it needs to reset the timestep to the previous value so that a simulation can continue with the normal, not reduced timestep. 5) REBOUND runs into floating point issues when the time variable is large compared to the timestep. That easily happens if one integrates for billions of timesteps. There is all kind of extra logic to prevent things from blowing up in that case. 6) All of this needs to be implemented so that it works when integrating backward or forward. Again sound easy because all that changes is a sign, but every < and > statement needs to be able to handle this.
I'll try to improve the current way this is implemented in REBOUND. But it's hard to foresee all the possible consequences. In the meantime solution a) significantly reduces the number of spikes we see in this test. It's not quite as good as using a fixed timestep, which I think makes sense. We'd expect a larger error when adaptive timesteps are turned on because the particle sees the planets at slightly different times on the way back, leading to slightly different errors than on the way out.
FWIW, similar to the argument before, @matthewholman you did not encounter this issue because you didn't change the direction of integration, you started a new integration when using assist_integrate
the second time.
tldr; this is a complex test! It's good to find these kind of issues, but I also don't think any of this matters in 99% of applications where one just integrates in one direction.
@hannorein When you reverse the timestep sign, is the magnitude of spikes still ~4 orders magnitude?
As to how common these spikes are, and how problematic they are in the forward direction, I assume their rate of occurrence has a scaling with the timestep.
@dmhernan This first plot shows what reversing the timestep sign does (old = not reversing timestep sign, new: reversing timestep sign)
This second plot shows the issue that still remains. And yes, the spikes are still large. For the run labeled "adaptive perturbed", I've perturbed the initial position by 1e-6 to see how sensitive the integration is. As you can see, the spikes appear first at the same time, but then appear randomly. I can't think of anything else other than the ephemeris being the origin for this. But I'd be happy to hear about other ideas!
@dmhernan Results from one more experiment if you want to think further about it. In the green curve, I'm integrating out with one fixed timestep, then change it by ~10% and integrate back (choosing the timestep so that start and end time match perfectly). No more spikes. š¤·āāļø
Alright, so maybe this is just IAS15 not converging to machine precision in those cases. If I slightly reduce to epsilon=1e-10, then the issue also goes away.
(Sorry for the flood of posts, don't feel obliged to follow)
Indeed, that seems to be what's going. Here is the same data, but also plotting the maximum adaptive timestep that occurred at some point during the integration. The spikes start exactly when the timestep increases. I'll need to look into why IAS15 thinks the timestep should be larger after some random time. No idea right now...
What a turn of events! If the time step guess is off, that seems to imply the error estimates are off. And the timestep problem persists even if all effects are removed except Newtonian gravity? Because if so, then the only difference with a usual simulation is the ephemeris issue. Could ephemeris error cause error estimates from the RK method to be off which then affects the timestep?
Yes, the problem persist with only the direct forces. I don't see the issue if I run the same simulation but use real N-body particles to calculate the forces instead of the ephemeris data.
Another thing that still doesn't make sense is why these are spikes. If this is due to the timestep being too large at some point, it should occur both in the out and back portions of the integration (50/50). But if this occurs on the outgoing path just once, then it should occur for all trange
s larger than some critical value because all the outgoing integrations are identical. The plot should show the error jumping up after some critical trange
and stay high because the error is accumulative. Instead we see spikes.
Sorry if you already mentioned this and I missed it, but do you see the same thing when the caching is turned off?
On Thu, Feb 2, 2023 at 9:29 AM Hanno Rein @.***> wrote:
Another thing that still doesn't make sense is why these are spikes. If this is due to the timestep being too large at some point, it should occur both in the out and back portions of the integration (50/50). But if this occurs on the outgoing path just one, then it should occur for all tranges larger than some critical value. The plot should show the error jumping up after some critical trange and stay high because the error is accumulative. Instead we see spikes.
ā Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1413834467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5CVWKNWCFI6Y6G5GCOPTTWVPAGBANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you were mentioned.Message ID: @.***>
-- Matthew J. Holman, PhD Senior Astrophysicist Center for Astrophysics | Harvard & Smithsonian 60 Garden Street, MS #51 Cambridge, MA 02138 (617) 496-7775
Yes. I've tried it with caching off.
No ideas now. Of course, one could integrate backwards first and then forwards to check if the spike is associated with a problem in one of the time directions.
We've just tried that. Looks the same.
Progress: I've convinced myself that we should expect to see spikes rather than a plateau if IAS15 is in very rare cases choosing a timestep which is slightly too large. Here's why:
Say the chance to guess the timestep wrong once during a 1000 day integration is 1%. Then we will most likely not see an error in an outwards integration up to 1000 days because it's only a 1% chance. However, if we run 1000 different simulations (because we sample different trange
s), we have 1000 different backwards integrations. So in total we should expect 10 spikes. These are spikes in the plot rather than a plateau because the error only occurs in the backwards trajectory (we have 1000 times as many backwards integrations as we have forward integrations). If trange
is much larger, say 100*1000 days, then we have a ~100% chance of getting the timestep wrong in the outwards integration. Then we should see a plateau. That would explain why we sometimes see a plateau in rare case. In most cases the plateau is buried in noise, because the noise increases with trange
whereas the height of the plateau stays the same.
It would also explain why this is not an issue in a normal integration that just goes in one direction. The error is so rare and so small, that it is buried in (the always increasing) numerical noise.
I think this explains everything. But I thought that before, so take it with a grain of salt.
I am very impressed with your detective work, @hannorein ! This does seems to explain most/all of the observed features of the issue. What do you think sets the ~1% scale for the rate of time step miscalculations?
I'm not sure. Clearly having just one particle is a bit of an unusual setup that I've never tested before. Normally, if there are multiple particles, at least 2, then the chance for this to occur would be much smaller, ~0.01^N I guess.
It looks like the somewhat long timestep follows a somewhat short timestep. That would make sense because it will be very hard to predict the timestep accurately if the previous one was on the small side (less curvature to observe). In the IAS15, there is this safety_factor
variable that determines how fast timesteps can grow/shrink from step to step. Currently this is set to 0.25 (increase/decrease by a factor of 4). It might make sense to set this closer to 0.5 or 0.75 for these high accuracy runs.
@hannorein Alternatively... how about perturbing initial conditions many times as well? If your explanation is correct, eventually one of those initial conditions will always give spiked forwards-backwards integration.
Or possibly changing the guess for the initial time step randomly? Not that these things need to be tested!
But @hannorein , does it make sense why the normal integrations that also evolve the planets never have spikes?
"Normal" integrations have more than 1 particle, so the probability of getting the timestep slightly wrong is much lower. If the spikes are rare then they disappear into the floating point noise. That was kind of the idea of choosing the specific combination of epsilon
and safety_factor
. I'm sure we could find other cases where this becomes an issue but I think it works well in almost all cases. For this specific case, we just need to change epsilon
or safety_factor
a little bit.
Further evidence along those lines: I've just run a test with two test particles and the issue is almost completely gone. With three test particles, it is completely gone and the timestep remains very constant.
@hannorein Ah, thanks. So your two test particles must be interacting with each other. Conceptually, I understand that with more particles the timestep change would be smoother. Perhaps with the RK method one should be able to estimate the new time step, as IAS15 does, but it may also be possible to estimate a reasonable time step derivative by getting a timescale out of the positions, velocities, accelerations. This could be used to make sure the time step jumps are reasonable.
Hi, I'm curious if there was a resolution to the large time step jumps, or is this left for future work? For an analytic estimate of a time step jump, reasonable functional forms are in eq. (5) and (6) in Boekholt et al. (2022): https://arxiv.org/pdf/2212.09745.pdf
I think this is addressed by @hannorein in integrate_or_interpolate?
On Mon, Feb 20, 2023 at 5:11 PM David M. Hernandez @.***> wrote:
Hi, I'm curious if there was a resolution to the large time step jumps, or is this left for future work? For an analytic estimate of a time step jump, reasonable functional forms are in eq. (5) and (6) in Boekholt et al. (2022): https://arxiv.org/pdf/2212.09745.pdf
ā Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1437604897, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMJBQ2JETNYAIDSVJQQZ6DWYPTYNANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I looked into the function assist_integrate_or_interpolate(), and it seems it's used to produce output at specified times. I thought the problem described in this thread occurs even if we don't need outputs?
I'm sorry, I don't understand what the issue is.
@hannorein , in the last plot of this thread, you showed the time step having large jumps during forward--backward integrations. I'm asking if something was changed to stop this, or is the current plan to ignore such error/time step jumps since they're rare?
Let me know if I'm missing something, but I can't think of a use case where the results would be affected by the current timestep choice. We're talking about precisions of the order of a meter! But if you want a smaller timestep for whatever reason, you can just change the epsilon value, use a maximum timestep, or a fixed timestep.
I see, yes 1m precision would be irrelevant for asteroid Holman. But how about such jumps in error (by orders magnitude) affecting a measurement of chaos, for instance? I guess a user could investigate further and quickly determine the chaos is not real...
Again, I fail to see how this could possibly affect anything. The Lyapnov timescale is probably millions of years. We're integrating for 100s of days with ASSIST.
One would need to be testing the limits of the code, like the long integrations in the paper (1e5 days), and consider asteroids with kyr Lyapunov times, like massive asteroids, for those scales to get comparable, if that's what you're getting at. Anyway, if the plan is to ignore the time step jumps, that seems reasonable to do.
Sigh. Nothing is getting ignored. I just don't believe there is any issue to begin with. If you think otherwise, please provide some evidence.
Hey @hannorein, the only thing I have is that if a user is integrating two nearby asteroids, and the difference in their positions jumps by four orders of magnitude, due to a time step jump, they may be confused as to where this divergence is coming from. If it's an issue that has confused me (for instance in Fig. 2 in the paper, perhaps a time step jump is responsible for the penultimate point), and confused us in this thread, it could perhaps confuse other researchers.
@dmhernan - Do you mean figure 6?
On Tue, Feb 21, 2023 at 12:28 PM David M. Hernandez < @.***> wrote:
Hey @hannorein https://github.com/hannorein, the only thing I have is that if a user is integrating two nearby asteroids, and the difference in their positions jumps by four orders of magnitude, due to a time step jump, they may be confused as to where this divergence is coming from. If it's an issue that has confused me (for instance in Fig. 2 in the paper, perhaps a time step jump is responsible for the penultimate point), and confused us in this thread, it could perhaps confuse other researchers.
ā Reply to this email directly, view it on GitHub https://github.com/matthewholman/assist/pull/43#issuecomment-1438851029, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOMJBQ5SQ6QC4HJ3IVGSSUDWYT3MFANCNFSM6AAAAAAUMOYS6Q . You are receiving this because you commented.Message ID: @.***>
While coding up the roundtrip test from the paper as a unit test, I've noticed something that doesn't seem quite right. The error seems to depend on the range in a very non-smooth way (have a look at the output below). The plot in the notebook and paper doesn't have that many data points, so I'm not sure if this is a new problem. In any case, I don't think this is correct as of right now. I suspect this is related to how the
reb_integrate
function handles the last timestep, which depending on where the timesteps fall, might have to be much smaller than a "normal" timestep. (but it could also be something completely different)