Closed bionicles closed 5 years ago
here's Alex Ororbia et al's Delta RNN, which is IMHO a simpler verson of Differentiable Neural Computer ( this is a cell, i'm not using the RNN wrapper for my project but it's easily adapted for that .. just a stateful layer )
nature.Layer is just my way to switch layers project-wide super fast.
AI holds data and methods for hyperparameter optimization with Optuna via an "AI.pull" method l ... with id=False it skips adding nano-id to uniquify the name of the parameter; thus the parameter will always be the same across runs of the project if id=False. Normally it just generates a new parameter each time, but i should probably set default to false now that i think about it
# https://gist.github.com/tam17aki/f8bebcc427f99a3432592e5ca0186cb8
# https://arxiv.org/pdf/1703.08864.pdf Ororbia et al 2017
import tensorflow as tf
import nature
L = tf.keras.layers
LAYER = nature.Layer
DROP_OPTIONS = [0., 0.5]
class Delta(L.Layer):
def __init__(self, AI, units=None):
super().__init__()
self.p_drop = AI.pull("delta_p_drop", DROP_OPTIONS, id=False)
self.ai = AI
def build(self, shape):
d_in = shape[-1]
self.gate_bias = self.add_weight("gate_bias", [d_in], trainable=True)
self.z_t_bias = self.add_weight("z_t_bias", [d_in], trainable=True)
self.state = self.add_weight("state", shape, trainable=False)
self.alpha = self.add_weight("alpha", [d_in], trainable=True)
self.b1 = self.add_weight("b1", [d_in], trainable=True)
self.b2 = self.add_weight("b2", [d_in], trainable=True)
self.fc1 = LAYER(self.ai, units=d_in)
self.fc2 = LAYER(self.ai, units=d_in)
self.out = nature.Fn(self.ai)
super().build(shape)
@tf.function
def call(self, x):
# inner
V_h = self.fc1(self.state)
W_x = self.fc2(x)
d1 = self.alpha * V_h * W_x
d2 = self.b1 * V_h + self.b2 * W_x
z_t = tf.nn.dropout(tf.nn.tanh(d1 + d2 + self.z_t_bias), self.p_drop)
# outer
gate = tf.nn.sigmoid(W_x + self.gate_bias)
self.state.assign(self.out((1. - gate) * z_t + gate * self.state))
return self.state
this resizer helps a ton in making stuff fit together easily. constructor takes a shape and figures out how to make that shape. optionally has a function afterward
class Resizer(L.Layer):
def __init__(self, AI, out_shape, key=None, layer=None):
super(Resizer, self).__init__()
size = get_size(out_shape)
self.resize = nature.Layer(AI, units=size, layer_fn=layer)
self.reshape = L.Reshape(out_shape)
self.fn = nature.Fn(AI, key=key)
self.out_shape = out_shape
self.flatten = L.Flatten()
self.built = True
@tf.function
def call(self, x):
x = self.flatten(x)
x = self.resize(x)
x = self.reshape(x)
x = self.fn(x)
return x
def compute_output_shape(self, shape):
return self.out_shape
this one is along the lines of resnet, but instead of adding, we multiply elementwise (hadamard product) so it's more like SWAG (although I think swag is a dense version) also Quadratic networks
it works insanely well at classification and regression
# The SWAG Algorithm (loosely based on)
# https://arxiv.org/abs/1811.11813
# also similar to a Nth order generalization of:
# Universal Approximation with Quadratic Deep Networks
# https://arxiv.org/pdf/1808.00098.pdf
import tensorflow as tf
import nature
L = tf.keras.layers
LAYER = nature.Layer
MIN_POWER, MAX_POWER = 2, 8
class SWAG(L.Layer):
def __init__(self, AI, layer_fn=LAYER, units=None):
super(SWAG, self).__init__()
power = AI.pull("swag_power", MIN_POWER, MAX_POWER)
self.zero = LAYER(AI, units=units)
self.fn = nature.Fn(AI)
self.layers = []
for p in range(power):
np = nature.NormPreact(AI)
super().__setattr__(f"np_{p}", np)
one = LAYER(AI, units=units)
super().__setattr__(f"one_{p}", one)
self.layers.append((np, one))
self.addnorm = nature.AddNorm()
self.built = True
@tf.function
def call(self, x):
ys = [self.zero(self.fn(x))]
for np, one in self.layers:
x = np(ys[-1])
ys.append(x * one(x))
return self.addnorm(ys)
ConcatCoords lets us add channels for the index of each point in a tensor. that's been shown to dramatically improve results from convolutions
# An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
# https://arxiv.org/abs/1807.03247
# https://github.com/titu1994/keras-coordconv/blob/master/coord.py
import tensorflow as tf
from tools import normalize, log
K, L, B = tf.keras, tf.keras.layers, tf.keras.backend
# reducer to select the correct rank layer automatically.
# this could probably all be merged into 1 layer
def Coordinator(shape):
log("build coordinator with shape:", shape, debug=True, color='green')
if len(shape) is 2:
return ConcatCoords2D()
elif len(shape) is 3:
return ConcatCoords3D()
elif len(shape) is 4:
return ConcatCoords4D()
else:
raise Exception(f"{shape} not supported by coordinator")
class ConcatCoords2D(L.Layer):
def __init__(self):
super(ConcatCoords2D, self).__init__()
self.handler = ConcatCoords3D()
self.built = True
@tf.function
def call(self, x):
x = tf.expand_dims(x, -1)
return self.handler(x)
def compute_output_shape(self, input_shape):
output_shape = list(input_shape)
output_shape.append(1)
output_shape[-1] = output_shape[-1] + 1
return tuple(output_shape)
class ConcatCoords3D(L.Layer):
def __init__(self):
super(ConcatCoords3D, self).__init__()
self.built = True
@tf.function
def call(self, x):
shape = tf.shape(x)
coords = tf.range(shape[1])
coords = tf.expand_dims(coords, 0)
coords = tf.expand_dims(coords, -1)
coords = tf.tile(coords, [shape[0], 1, 1])
coords = tf.cast(coords, tf.float32)
coords = normalize(coords)
return tf.concat([x, coords], -1)
def compute_output_shape(self, input_shape):
output_shape = list(input_shape)
output_shape[-1] = output_shape[-1] + 1
return tuple(output_shape)
class ConcatCoords4D(L.Layer):
def __init__(self):
super(ConcatCoords4D, self).__init__()
def build(self, shape):
h = tf.range(shape[1], dtype=tf.float32)
w = tf.range(shape[2], dtype=tf.float32)
h, w = normalize(h), normalize(w)
hw = tf.stack(tf.meshgrid(h, w, indexing='ij'), axis=-1)
hw = tf.expand_dims(hw, 0)
self.hw = tf.tile(hw, [shape[0], 1, 1, 1])
super().build(shape)
@tf.function
def call(self, x):
return tf.concat([x, self.hw], -1)
def compute_output_shape(self, input_shape):
output_shape = list(input_shape)
output_shape[-1] = output_shape[-1] + 2
return tuple(output_shape)
"Circulator" is a VAE which predicts inputs, encodes them, then reconstructs them, in a loop, and this way you can get much more training data out of it. It passes along prediction errors so it's good to put L.ActivityRegularization() immediately after this one so we punish it for error (could tinker with long-run surprisal as curiousity signal a la Friston Free Energy
import tensorflow as tf
import nature
L = tf.keras.layers
LAYER_OPTIONS = [nature.Layer, nature.Attention, nature.MLP, nature.SWAG]
LOOP_OPTIONS = [1, 2, 3, 4]
# @tf.function(experimental_relax_shapes=True)
def ERR(true, pred):
return tf.math.abs(tf.math.subtract(true, pred))
class Circulator(L.Layer):
def __init__(self, AI, units=None, layer_fn=None):
super().__init__()
if not layer_fn:
self.layer = AI.pull("circulator_layer", LAYER_OPTIONS, id=False)
self.n_loops = AI.pull("circulator_loops", LOOP_OPTIONS, id=False)
self.ai = AI
def build(self, shape):
self.code = self.add_weight(
"code", shape,
initializer=nature.Init(), regularizer=nature.L1L2())
self.encode = self.layer(self.ai, units=shape[-1])
self.decode = self.layer(self.ai, units=shape[-1])
self.fn = nature.Fn(self.ai)
self.out = L.Add()
super().build(shape)
@tf.function
def call(self, x):
prev_prediction = prev_reconstruction = x
prev_code = self.code
errors = []
for n in range(self.n_loops):
prediction = self.fn(self.decode(prev_code))
code = self.fn(self.encode(prev_reconstruction))
reconstruction = self.fn(self.decode(code))
errors.extend([
ERR(prev_prediction, prediction),
ERR(x, prediction) * 420.,
ERR(prev_reconstruction, reconstruction),
ERR(x, reconstruction) * 420.,
ERR(prev_code, code)])
prev_reconstruction = reconstruction
prev_prediction = prediction
prev_code = code
y = self.out(errors)
return y
finally here's an all-attention transformer, which just means we drop the output layer in the block and instead put those parameters as extra vectors for the attention mechanism. supposedly simpler. I used the Delta RNN to update the memory state, but that's not strictly necessary
import tensorflow as tf
import nature
L = tf.keras.layers
INIT = nature.Init
REG = nature.L1L2
LAYER = nature.Layer
D_MODEL_OPTIONS = [8, 16, 32, 64, 128, 256]
MEMORY_SIZE_OPTIONS = [32, 512]
N_HEADS_OPTIONS = [1, 2, 4]
DROP_OPTIONS = [0., 0.5]
UNITS = None
class Attention(L.Layer):
def __init__(self, AI, units=UNITS, layer_fn=LAYER):
super(Attention, self).__init__()
self.memory_size = AI.pull("attn_memory_size", MEMORY_SIZE_OPTIONS)
self.d_model = AI.pull("attn_d_model", D_MODEL_OPTIONS)
self.n_heads = AI.pull("attn_n_heads", N_HEADS_OPTIONS)
self.p_drop = AI.pull("attn_p_drop", DROP_OPTIONS)
assert self.d_model % self.n_heads == 0
self.depth = self.d_model // self.n_heads
self.delta = nature.Delta(AI)
self.memory = self.add_weight(
'memory', (1, self.memory_size, self.d_model), initializer=INIT(),
regularizer=REG(), trainable=False)
self.dense = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
self.wq = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
self.wk = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
self.wv = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
self.layer_fn = layer_fn
self.units = units
self.ai = AI
def build(self, shape):
units = self.units if self.units else shape[-1]
self.channel_changer = tf.identity
if units != self.d_model:
self.channel_changer = nature.Layer(
self.ai, units=units, layer_fn=self.layer_fn)
super().build(shape)
@tf.function
def split_heads(self, x, batch_size):
"""Split the last dimension into (n_heads, depth).
Transpose to (batch_size, n_heads, seq_len, depth)
"""
x = tf.reshape(x, (batch_size, -1, self.n_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
@tf.function
def call(self, sequence):
shape = tf.shape(sequence)
batch_size = shape[0]
seq_len = shape[1]
q = self.wq(sequence) # (batch_size, seq_len, d_model)
k = self.wk(sequence) # (batch_size, seq_len, d_model)
v = self.wv(sequence) # (batch_size, seq_len, d_model)
memory = tf.tile(self.memory, [batch_size, 1, 1])
q = tf.concat([q, memory], 1)
k = tf.concat([k, memory], 1)
v = tf.concat([v, memory], 1)
q = tf.nn.dropout(q, self.p_drop)
k = tf.nn.dropout(k, self.p_drop)
v = tf.nn.dropout(v, self.p_drop)
q = self.split_heads(q, batch_size) # (batch_size, n_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size) # (batch_size, n_heads, seq_len_k, depth)
v = self.split_heads(v, batch_size) # (batch_size, n_heads, seq_len_v, depth)
# scaled_attention.shape == (batch_size, n_heads, seq_len_q, depth)
# attention_weights.shape == (batch_size, n_heads, seq_len_q, seq_len_k)
scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v)
scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, n_heads, depth)
concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model)
attended = self.dense(concat_attention) # (batch_size, seq_len_q, d_model)
# WHERE THE MAGIC HAPPENS:
attended, memory = tf.split(attended, [seq_len, self.memory_size], axis=1)
memory = tf.math.reduce_mean(memory, 0, keepdims=True)
memory = self.delta(memory)
self.memory.assign(memory)
attended = self.channel_changer(attended)
return attended
@tf.function
def scaled_dot_product_attention(q, k, v, mask=None):
"""Calculate the attention weights.
q, k, v must have matching leading dimensions.
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
The mask has different shapes depending on its type(padding or look ahead)
but it must be broadcastable for addition.
Args:
q: query shape == (..., seq_len_q, depth)
k: key shape == (..., seq_len_k, depth)
v: value shape == (..., seq_len_v, depth_v)
mask: Float tensor with shape broadcastable
to (..., seq_len_q, seq_len_k). Defaults to None.
Returns:
output, attention_weights
"""
matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)
# scale matmul_qk
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# add the mask to the scaled tensor.
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# softmax is normalized on the last axis (seq_len_k) so scores sum to 1.
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
return output, attention_weights
import tensorflow as tf
import nature
class Transformer(tf.keras.layers.Layer):
def __init__(self, AI, units=None):
super(Transformer, self).__init__()
self.units = units
self.ai = AI
def build(self, shape):
self.attention = nature.Attention(self.ai, units=shape[-1])
self.add_norm = nature.AddNorm()
# self.layer = LAYER(units=units)
# self.add_norm_2 = nature.AddNorm()
super().build(shape)
@tf.function
def call(self, x):
y = self.attention(x)
y = self.add_norm([x, y])
# z = self.layer(x)
# y = self.add_norm_2([y, z])
return y
edit: here's a working NoiseDrop (parameter noise + dropconnect) ... i dont recall why i made the functional interface but it definitely works. Init and L1L2 just allow me to switch initializers and regularizers fast
note this one tends to be over-regularized and that can sometimes cause issues with NaNs ... might be good with more units, lower P_DROP
# Parameter Noise: https://openai.com/blog/better-exploration-with-parameter-noise/
# DropConnect: http://yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdf
import tensorflow as tf
from nature import Init, L1L2
INITIALIZER = Init
REGULARIZER = L1L2
ACTIVATION = None
STDDEV = 0.04
P_DROP = 0.5
UNITS = 16
def NoiseDrop(
units=UNITS,
activation=ACTIVATION,
kernel_regularizer=REGULARIZER,
activity_regularizer=REGULARIZER,
bias_regularizer=REGULARIZER,
kernel_initializer=INITIALIZER,
bias_initializer=INITIALIZER,
**kwargs):
return _NoiseDrop(
units=units,
activation=activation,
kernel_regularizer=kernel_regularizer(),
# activity_regularizer=activity_regularizer(),
bias_regularizer=bias_regularizer(),
kernel_initializer=kernel_initializer(),
bias_initializer=bias_initializer(dist='truncated'),
**kwargs)
class _NoiseDrop(tf.keras.layers.Dense):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
@tf.function
def add_noise(self):
return (
self.kernel + tf.random.truncated_normal(
tf.shape(self.kernel), stddev=STDDEV),
self.bias + tf.random.truncated_normal(
tf.shape(self.bias), stddev=STDDEV))
@tf.function
def call(self, x):
kernel, bias = self.add_noise()
return self.activation(
tf.nn.bias_add(
tf.keras.backend.dot(x, tf.nn.dropout(kernel, P_DROP)),
bias))
i have lots more but those are some favorites which seem nice for Addons
Wish list:
Deformable Conv https://arxiv.org/abs/1703.06211
Network Deconvolution https://arxiv.org/abs/1905.11926 (supposedly better than batch norm)
Differentiable Neural Dictionary (from Neural Episodic Control) https://arxiv.org/abs/1703.01988
Hypernetworks are definitely high-yield (I tried a wrapper but it breaks in graph mode bc it tries to convert tensors into numpy arrays) ... so we'd need individual layers for this for now :( Dense and Conv1D, Conv2D are usually sufficient for most projects https://arxiv.org/abs/1609.09106
Reservoirs like echo state network have posted some insane results... i think we just need Spectral Normalization near 1.0 http://www.scholarpedia.org/article/Echo_state_network
Optimization Layer (https://arxiv.org/pdf/1703.00443.pdf) / https://locuslab.github.io/qpth/
A "quantizer" layer from VQ-VAE would be "sick nasty" https://arxiv.org/abs/1906.00446
Torch.nn.bilinear looks cool and trivial to implement, like the interpolation part of the Delta RNN
Mixed convolutions https://arxiv.org/abs/1907.09595
Clockwork RNN is a sweet concept https://arxiv.org/abs/1402.3511
perhaps David Ha can help us add Weight Agnostic Neural Networks https://ai.googleblog.com/2019/08/exploring-weight-agnostic-neural.html
Capsules are really interesting https://en.wikipedia.org/wiki/Capsule_neural_network
Neural ODE is clearly groundbreaking and won Best paper at NeurIPS 2018 but so far doesn't have great implementations in TF... can 2.0 help us implement Neural ODE keras layers? https://arxiv.org/abs/1806.07366
Sparsely gated mixture of experts is really cool... constructor could just take a list of layers and automatically gate them https://arxiv.org/abs/1701.06538
I am really interested in Transformer
, but it seems that the tensorflow/models
has already covered this. Not so sure if we should and if we could maintain a mirror for it. Multi-head attention is also an attractive feature although the single-head attention, tf.keras.layers.Attention
, exists in core TF.
I will leave this open as it's a large amount of information. However, I would request that a separate issue be opened if anyone is attempting to work on one of these. That will allow us to discuss the specific layer and properly assign someone to work on it.
Closing this now for maintainability... please feel free to re-open any specific layers as a their own separate issue.
Feel free to point others toward this issue though because there is a good amount of information here. Thanks!
Describe the feature and the current behavior/state. layers are super fun to write and use and can be quite powerful
Relevant information
Which API type would this fall under (layer, metric, optimizer, etc.) layer Who will benefit with this feature? keras users Any other info. some ideas to follow