Closed Joshuaalbert closed 6 years ago
This is a cool modification, thanks for sharing! We are going to keep this repo frozen as a reference implementation from the nature paper so we will not update the addressing mechanisms here but it's interesting to hear about improvements.
I have tried your method that avoid top_k, and I replace the _allocation function in addressing class with
def _allocation(self, usage):
with tf.name_scope('allocation'):
relative_usage = tf.nn.softmax(usage)
relative_non_usage = 1. - relative_usage
relative_non_usage -= tf.reduce_min(relative_non_usage)
allocation_weights = tf.nn.softmax(relative_non_usage)
return allocation_weights`
and also remove the stop_gradient line you mentioned, it looks worked, compared with original DNC, the loss looks like:
it looks worked (origin is new DNC, blue is original one), and it did performs copy with less glitch on loss, however, if I look at what happened inside the memory, following is the memory and link matrix in new DNC:
in original DNC:
it turns out the original DNC learn this memory content and order pattern, while the new one which used your suggested way seems not. I wonder do you have such problems? or if I did something wrong in my modification?
Thanks
Context
I have replicated the DNC which you show in theory in the Nature paper and in code your repository, with several modifications to addressing. In my case, I am using the framework of keras rather than sonnet. Originally, implementing the DNC as presented in the Nature paper lead to a fairly unstable model for some problems (initialization could have a huge impact on learn-ability). This lead me to reformulate each of the dynamic addressing mechanisms.
Enhancement
Here I request/point out an enhancement to the usage allocation weighting. I have chosen to implement it without a sorting, which means you can remove this line. This also allows the user to specify an inferrable batch_size. Pardon me if I'm mistaken, but I think it is impossible to have inferrable dimensions and use
tf.unstack
, without resorting toTensorArray
's or dynamic partitioning.This was done as follows (you can infer what the variables names are, and ignore
self
's as it is paste from some classes):before write weights
get write weights
Intuition of change
The usage is better represented as an unbounded positive number of access times per slot rather than a number between 0 and 1. The free gates can reset these numbers as in the original implementation. The allocation weighting then is a simple (albeit approximate) distribution over the relative non-usage. It deviates from from the way a computer works (in that memory locations cannot be both used and unused on a computer), but it results in a smoother response to changes in memory access patterns. This approximation is counter-balanced by the fact that the write weights remain differentiable, and the sharpness of the allocation weights, as a result, remains quite nominal.
Result
In the problems I apply it too, I had noticeably faster training, and the allocation gates were slightly more often close to 1 (usage addressing preference).
Note: The faster learning might also be related with the temporal linking modifications also implemented.
System
Keras 2.0.8, tensorflow 1.3.0