mumax / 3

GPU-accelerated micromagnetic simulator
Other
447 stars 150 forks source link

Minimizer() gives inconsistent results for the same input, between runs and in the same run. #329

Open jsampaio opened 7 months ago

jsampaio commented 7 months ago

The Minimize() function runs for a different number of steps for the same input state, and gives a different end state.

Demonstrator code below. In this code, minimize() is called five times with exactly the same initial magnetisation m (in a for loop), and the output is checked (exemplified here by the magnetisation at an arbitrary cell). This script was run several times. Result: The end result is different everytime the script is run, as well as for every one of the executions with the same initial state (in the for loop). Expected Result: the same input should produce exactly the same output. Specially, between runs of the script, but also between runs with the same initial state.

This is a problem because:

I've quickly checked the code of engine/minimizer.go and cuda/minimize.cu and it is not clear to me what causes this random behaviour.

Demonstrator code:

SetMesh(128, 64, 1, 4e-9, 4e-9, 4e-9, 0, 0, 0); 
Msat = 1e6; Aex = 10e-12; alpha = 1.;
tableaddvar(step, "step", "");
tableadd(MaxTorque);
tableAdd(Crop(m, 30, 31, 30, 31, 0, 1)); // mag in one arbitrary cell (more sensitive to small differences than the whole average)

for i := 0; i < 5; i++ {
    m = TwoDomain(0.,0,1., 0,1,0, 0,0.,-1.);
    step=0;
    Minimize(); 
    tablesave(); print(t, step, MaxTorque, Crop(m, 30, 31, 30, 31, 0, 1).average()); 
}

Outputs.

Here's the output for two runs on the same computer. The columns are:

Run 1:

//0 1708 1.3587993969904421e-06 [0.24186624586582184 -0.9649359583854675 0.10197720676660538]
//0 706 1.9388212427259442e-06 [0.7151626944541931 -0.6850084662437439 -0.13894522190093994]
//0 1622 1.4579250321644758e-06 [-0.7151928544044495 0.6849732398986816 -0.13896377384662628]
//0 1650 1.943069133851686e-06 [0.5473531484603882 -0.8349050283432007 0.057777922600507736]
//0 1214 1.4078213793882306e-06 [0.5473809838294983 -0.8348858952522278 -0.05779007822275162]

Run 2:

//0 980 1.620687266051533e-06 [0.5473484992980957 -0.8349082469940186 -0.057776421308517456]
//0 842 1.4022564510987115e-06 [0.7151708602905273 -0.6849989295005798 -0.13895045220851898]
//0 924 2.0761421675785363e-06 [0.5473533868789673 -0.8349048495292664 -0.05777806416153908]
//0 776 1.9706511693829087e-06 [0.5473541021347046 -0.8349043130874634 -0.05777836963534355]
//0 964 1.8854423685616054e-06 [0.5473534464836121 -0.8349047303199768 -0.05777810141444206]
MathieuMoalic commented 7 months ago

Hi, check this issue from 7days ago, it might answer your questions:

https://github.com/mumax/3/issues/328#issue-1989298198

DeSanz commented 7 months ago

Hello, I had a similar problem some years ago related to the solver not flushing its memory properly, you can try to force a solver flush to see if the problem is coming from there. See the issue here: #260

jsampaio commented 7 months ago

Thanks for your inputs, @DeSanz and @MathieuMoalic!

It seems very similar to issue #260. However, the cache flush code from that issue (SetSolver(1); steps(10); SetSolver(5);) doesn't work when using Minimizer() function (see output below, showing different outputs from the same loop; these outputs are different also between different runs). I think it is because Minimizer() does not use the solvers to do its steps.

I have another question that perhaps @DeSanz and @JeroenMulkers know the answer to: with commit #261 (that was merged to master on May 2020), the issue #260 should not be produced anymore, right? However, when I run the demonstration code that @DeSanz included in his issue, with a code compiled last month, I still observe the original (undesired) behaviour...

Demonstrator code (with cache flush)

SetMesh(128, 64, 1, 4e-9, 4e-9, 4e-9, 0, 0, 0); 
Msat = 1e6; Aex = 10e-12; alpha = 1.;
for i := 0; i < 5; i++ {
    m = TwoDomain(0.,0,1., 0,1,0, 0,0.,-1.);
    SetSolver(1); steps(10); SetSolver(5); //flushing cache
    m = TwoDomain(0.,0,1., 0,1,0, 0,0.,-1.);
    t = 0; step=0;
    Minimize(); 
    print(t, step, MaxTorque, Crop(m, 30, 31, 30, 31, 0, 1).average()); }

output

The columns are Time / number os steps used by Minimize() / final MaxTorque / m at an arbitrary cell (30,30,1).

//0 812 1.9369425157848546e-06 [0.24995791912078857 -0.9629429578781128 0.10130093991756439]
//0 2140 1.2993552509700089e-06 [0.9499351382255554 -0.31243622303009033 0.0026495191268622875]
//0 1514 1.3886210098205397e-06 [0.8548673391342163 -0.5159158110618591 0.05507013946771622]
//0 738 1.6608056415286746e-06 [0.7151972055435181 -0.684968113899231 0.1389666497707367]
//0 1128 1.3945995817212263e-06 [0.5473573207855225 -0.8349020481109619 0.05777895078063011]
jplauzie commented 7 months ago

Hi,

I think that issue should indeed be fixed with #261 , part of the fix includes changes to minimizer. I think the fact it's different between different runs also suggests it's not a cache issue, because nothing will be cached between separate program instances.

To test it further, I tried running a modified version of the test file steppercache.mx3 that was included in that merge (slightly modified to work with minimize instead. I forced minimize to only do 1 step by setting Minimizerstop and Minimizersteps to 1. This is slightly kludgey, but you normally can't control how many steps minimize does). It's attached as steppercache_minimizer.txt (github didn't like the .mx3 and .go extensions) . It seems to function correctly. Below are the time, steps, and torque values. The values of the torque are what you'd expect and don't seem to be getting improperly reused:

//0 1 -2.4752475624723047e-08 0.049504950642585754 0.004950495436787605 [1 6.675998321995304e-22 4.999999873689376e-06] //0 1 2.4752475624723047e-08 -0.049504950642585754 0.004950495436787605 [-1 -6.675998321995304e-22 4.999999873689376e-06]

I also reran the script after modifying the minimizer go function to end after 1 step (attached as Minimizer_mod.txt), by changing line 157 of Minimizer to "return (NSteps < 1)", so that it returns at one step. I also printed out the k values prior to the descent. Again, I think the values are ok, and not what you'd expect if the buffer isn't being cleared. The k values are

// -0, 6.6759985e-18,0.05 // -0, -6.6759985e-18, 0.05

and the (time, steps, torques, magnetization of the single cell): //0 1 -2.4752475624723047e-08 0.049504950642585754 0.004950495436787605 [1 6.675998321995304e-22 4.999999873689376e-06] //0 1 2.4752475624723047e-08 -0.049504950642585754 0.004950495436787605 [-1 -6.675998321995304e-22 4.999999873689376e-06]

I'm not as proficient with Golang, so it's difficult for me to parse the minimizer script. But the relevant lines in Minimizer seem to do what you would expect for clearing the buffer. Before minimizing, the stepper, if it is not nil, is freed at line 146 (and it is fully freed, instead of recycling), and the k value is nil'd at 152 and recalculated from scratch.

Also, is there any particular reason you chose the gridsize you did? That seems like it might be influencing the results. The exchange length for those materials is ~4nm. If I drop the cellsize to 2nm with no other changes, I get

//0 6278 6.829139697192452e-06 [0.6459949016571045 0.506130576133728 0.5714214444160461] //0 9970 6.9236472013054965e-06 [0.626321017742157 0.517693281173706 0.5828513503074646] //0 5552 5.696882147088674e-06 [0.706771194934845 0.48067420721054077 0.5190634727478027] //0 10820 6.807892596907236e-06 [0.6255554556846619 0.5181707143783569 0.5832491517066956] //0 5550 5.579739075319664e-06 [0.6891061067581177 0.4887142777442932 0.5350618958473206]

While normally the cellsize shouldn't matter much for consistency (even if it's an improper cellsize, it should be consistent in it's result), I think in this case it might matter a lot, since the vortex core is around that size. Especially because the cell you happened to choose is very very close to the vortex core. If the vortex core shifts slightly, even 1 cell over, the magnetization of cell 30,30,0 (which is quite close to the core) will shift quite a lot. I've marked the cell (after the simulation is done, it didn't influence the dynamics) here, in black:

mumax

If you look at the values you reported, they seem to correspond to where exactly the cell is relative to the vortex core; if it's just below the core, it'll be ~(0.9,-0.3,0), whereas if it's to the left of the core, it'll be more like ~(0.2,-0.9,0). So some of that instability might be from your cellsize, the end magnetic state (a diamond shape), and the cell you picked is very close to the center of one of the vortices.

A better comparison would be to decrease the cellsize to 2nm, but increase the number of cells to 256/128/1 (to keep the same simulation size), and look at cell (60,60,0), which is in approximately the same spot physically, it also becomes more consistent, despite being still quite close to the core:

//0 15832 6.818977263708361e-06 [0.9508864283561707 -0.18134881556034088 0.25085410475730896] //0 15274 6.950267380873044e-06 [0.9508272409439087 -0.18172381818294525 0.25080665946006775] //0 15678 7.329368784995427e-06 [0.9508079290390015 -0.1817532330751419 0.2508585751056671] //0 15820 8.692941164399005e-06 [0.9505621790885925 -0.18322229385375977 0.2507213056087494] //0 15952 7.645881771758554e-06 [0.95073401927948 -0.1822352409362793 0.25078916549682617]

It's also worth mentioning, as noted in this discussion, minimize() seems to struggle with vortices in particular: https://groups.google.com/g/mumax2/c/o_SfyV7CNek/m/Z3uUR-r0BgAJ

So I think ultimately, what you're seeing is a combination of the limits of minimize (due to things like single precision and nonassociativity, the method itself, etc), combined with a problem (vortices) that is difficult for it, and a test that is inadvertently very sensitive due to the unlucky cell you arbitrarily picked. It is a bit interesting that relax() does not have this issue (it was completely consistent, even with the original 4nm script). If someone wanted to tinker with it, it's possible minimize could be improved, perhaps by renormalizing the magnetization at each step as suggested in #46.

(Sidenote: I did test loading in an m initial file as suggested by Mathieu, it didn't seem to help in this particular case. Handy trick, though)

Best, Josh Lauzier

steppercache_minimizer.txt minimizer_mod.txt