Open ehannigan opened 4 years ago
This may have been a better question to post in the DiffTaiChi repo. I will post it there instead (https://github.com/yuanming-hu/difftaichi/issues/31#issue-664509249) and summarize any response I get here.
Hey @ehannigan ! I just played around with this and found that there's randomness deriving from both Python's stdlib random
and np.random
. I tried setting random seeds on both of them in https://github.com/yuanming-hu/difftaichi/pull/34, and I'm now seeing deterministic results when running
python examples/mass_spring.py 0 train
Hey! Thank you @samuela! I thought I had tried setting a seed (using numpy's RandomState object), but I must have messed up somewhere. I'll go back and try running it with your fix.
I tried adding in these two lines, and I am still not getting repeatable results. Maybe you are using a different setup? What measure are you using to see if your results are the same each time? I'm looking at loss values.
Here is my current setup: python 3.7.3 taichi == 0.6.21 llvm == 10.0.0 numpy==1.18.5
Here are the loss outputs after running the same command twice in a row:
First run: python mass_spring.py 2 train
n_objects= 20 n_springs= 46
Iter= 0 Loss= -0.21222630143165588
0.1129670010156612
Iter= 1 Loss= -0.21599465608596802
0.06594551447062441
Iter= 2 Loss= -0.25001487135887146
0.13671642659517222
Second run: python mass_spring.py 2 train
n_objects= 20 n_springs= 46
Iter= 0 Loss= -0.21222639083862305
0.11296541652168635
Iter= 1 Loss= -0.21599650382995605
0.0659292521378909
Iter= 2 Loss= -0.25000977516174316
0.1367099362027803
Iter= 3 Loss= -0.29366904497146606
0.10108566298371975
Is there anything else I could be missing to get the same results you are getting? I'm tearing my hair out on this one lol.
Sorry for my absence - recent days have been rather hectic for me.
Do you get any improvements if you use ti.f64
in
https://github.com/yuanming-hu/difftaichi/blob/4742b1c84b045ea64da2eae99a3240b2ae0ebad0/examples/mass_spring.py#L11
?
Some strange results:
I tried changing to f64 and that didn't change anything.
I also tried setting f64 and i64 but I got this error:
Assertion failed: (S1->getType() == S2->getType() && "Cannot create binary operator with two operands of differing type!"), function Create, file /Users/th3charlie/dev/taichi-exp/
So I started a jupyter notebook to keep track of my debugging so I could post it here. ~~But in the jupyter notebook, when I ran mass_spring.py 2 train three times in a row using ~~
%run -i mass_spring.py 2 train,
~~
the output was deterministic; all the losses matched even when I was using f32. (Edit: I tried this again and did not get the same result. Maybe I made a mistake, or read the results wrong.)
I will finish up my debugging notebook and post it tomorrow. If you have any insights, let me know.
I've created a jupyter notebook to outline my debugging process. Since there were some updates to difftaichi due to updates in taichi, I went ahead and updated my version just to make sure we weren't debugging old code.
Here is the notebook: https://github.com/ehannigan/difftaichi/blob/testing_determinism/examples/debug_determinism-current.ipynb
I tried running mass_spring.py without any modifications. I tried switching to f64. I tried also changing i32->i64 (which caused an error), and I tried using np.random.RandomState() instead of np.random.seed(). At least in my system, the results are still not deterministic.
Could someone try running my jupyter notebook on their machine to see if you get the same results?
Is there cuda in the backend? Is it possible that a function similar to this one needs to be added?
torch.backends.cudnn.deterministic=True
https://github.com/pytorch/pytorch/issues/7068
Is there cuda in the backend? Is it possible that a function similar to this one needs to be added?
torch.backends.cudnn.deterministic=True
pytorch/pytorch#7068
There is although I think you need to be selecting it for it to be enabled. Default for mass_spring is CPU-only IIRC.
Hmmm, then idk why I am still getting stochastic results. @samuela , you said you were able to get repeatable results? Were they just similar results, or did you get losses that matched exactly? If so, what is your system setup?
I was debugging some modifications I made to mass_spring.py when I realized that the result of each run is non-deterministic. I went back to the original mass_spring.py and made sure the controller network weights were initialized to the same value each time. But even when I can guarantee that there are no random variables being assigned anywhere, the resulting loss differs in each run.
Here are two different runs of the exact same code. You can see that the controller weights are exactly the same, but the loss values begin to diverge.
Run 1: mass_spring.py 2 train
n_objects= 20 n_springs= 46 weights1[0,0] -0.23413006961345673 weights2[0,0] 0.46663400530815125 Iter= 0 Loss= -0.2193218171596527 0.19502715683487248 Iter= 1 Loss= -0.21754804253578186 0.07976935930575488 Iter= 2 Loss= -0.3397877812385559 0.055776006347379746 Iter= 3 Loss= -0.3514309227466583 0.03870257399629174
Run 2: mass_spring.py 2 trainn_objects= 20 n_springs= 46 weights1[0,0] -0.23413006961345673 weights2[0,0] 0.46663400530815125 Iter= 0 Loss= -0.21932175755500793 0.1950520028177551 Iter= 1 Loss= -0.21754644811153412 0.07983238023710348 Iter= 2 Loss= -0.3397367000579834 0.055822440269175766 Iter= 3 Loss= -0.3514898419380188
In my own modifications, this was resulting in inconsistent failures of the simulation (v_inc will explode and all values will go to nan). I assume this is due to instabilities in Euler integration, but it would be nice to be able to get consistent results each time to make debugging easier.
Where could the non-deterministic behavior be coming from? Is it something we can fix, or are there stochastic processes that are a result of the compiler?