uber-research / safemutations

safemutations
Other
143 stars 15 forks source link

In sm_simple.py, SM-G-SUM and SM-G-ABS scaling differ by sz^2 #2

Open GLJeff opened 5 years ago

GLJeff commented 5 years ago

Note that torch.autograd.backward() calculates the sum of gradients in all states (at least in 0.4.1 https://pytorch.org/docs/stable/autograd.html?highlight=backward#torch.autograd.backward)

SM-G-SUM feeds backward() outputs of 1 and then uses the returned gradients unaltered (ie their sum across states) SM-G-ABS feeds backward() outputs of 1/sz and then manually calculates the mean of the gradients of the individual states, whereas in SM-G-SUM they were summed inside of backward()

The result is SM-G-SUM using a scale that is sz^2 larger in magnitude than SM-G-ABS. This is difficult to notice when the length of the states is only 2 as in the example, especially so since SM-G-ABS will return a naturally larger scale due to no washout.

Absolutely awesome work on your genetic and evolutionary research! Safe mutations are an incredible milestone in genetic optimization! Now just throw away tensorflow and pytorch and start coding in pure Cuda like you ought to be :)

GLJeff commented 5 years ago

To further clarify: I believe both implementions are wrong in the sense that they are not finding a scaling vector independent of the number of states.

SM-G-SUM should set: grad_output[:, i] = 1.0/len(_states) since the gradients get summed by the backward() pass

SM-G-ABS should EITHER: a) grad_output[:, i] = 1.0 since these values are then averaged along the 2 axis or b) mean_abs_jacobian = torch.abs(jacobian).sum(2) to sum them instead of averaging them