Cost gradient is NaN if the prediction is 0 or 1

As the NTM is trained on the cross entropy cost function, its gradient is not defined if the prediction is exactly 0 or 1.

In [17]: output
Out[17]:
array([[[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  0.]]])

In [18]: ntm_fun(input)
Out[18]:
array([[[  1.32529286e-35,   1.26697327e-23,   1.05187953e-46,
           7.28068425e-57,   3.54038684e-42,   3.13652896e-48,
           1.06308336e-22,   2.13761781e-93,   1.46246878e-45],
        [  6.83159575e-07,   2.58929423e-03,   2.93741369e-04,
           1.88398819e-08,   1.14350232e-04,   8.40953354e-12,
           6.11956356e-06,   5.67079701e-12,   1.30867115e-14],
        [  9.31182794e-01,   7.83335632e-01,   9.78310319e-01,
           9.99838718e-01,   9.92572563e-01,   9.95089144e-01,
           1.62297365e-01,   1.00000000e+00,   2.65213791e-08]]])

In [19]: new_params
Out[19]: 
...
b_dense:
array([ 1.62206065,  0.91350565,  0.07703372,  1.46771803,  1.06758648,
         0.60542304,  1.51142395,         nan, -4.70645698])]

One solution could be to scale the predictions in a [epsilon, 1 - epsilon] range.

snipsco / ntm-lasagne

Cost gradient is NaN if the prediction is 0 or 1 #5