please add more layers - Githubissues

Describe the feature and the current behavior/state. layers are super fun to write and use and can be quite powerful

Relevant information

Are you willing to contribute it (yes/no): yes
Are you willing to maintain it going forward? (yes/no): yes
Is there a relevant academic paper? (if so, where): links below
Is there already an implementation in another framework? (if so, where): researchers tend to use pytorch to implement things which were impossible in tf 1.0 ... we could revisit these and make TF.Keras a lot better now that 2.0 is out
Was it part of tf.contrib? (if so, where): not sure, never used contrib

Which API type would this fall under (layer, metric, optimizer, etc.) layer Who will benefit with this feature? keras users Any other info. some ideas to follow

here's Alex Ororbia et al's Delta RNN, which is IMHO a simpler verson of Differentiable Neural Computer ( this is a cell, i'm not using the RNN wrapper for my project but it's easily adapted for that .. just a stateful layer )

nature.Layer is just my way to switch layers project-wide super fast.

AI holds data and methods for hyperparameter optimization with Optuna via an "AI.pull" method l ... with id=False it skips adding nano-id to uniquify the name of the parameter; thus the parameter will always be the same across runs of the project if id=False. Normally it just generates a new parameter each time, but i should probably set default to false now that i think about it

# https://gist.github.com/tam17aki/f8bebcc427f99a3432592e5ca0186cb8
# https://arxiv.org/pdf/1703.08864.pdf Ororbia et al 2017

import tensorflow as tf
import nature

L = tf.keras.layers
LAYER = nature.Layer
DROP_OPTIONS = [0., 0.5]

class Delta(L.Layer):

    def __init__(self, AI, units=None):
        super().__init__()
        self.p_drop = AI.pull("delta_p_drop", DROP_OPTIONS, id=False)
        self.ai = AI

    def build(self, shape):
        d_in = shape[-1]
        self.gate_bias = self.add_weight("gate_bias", [d_in], trainable=True)
        self.z_t_bias = self.add_weight("z_t_bias", [d_in], trainable=True)
        self.state = self.add_weight("state", shape, trainable=False)
        self.alpha = self.add_weight("alpha", [d_in], trainable=True)
        self.b1 = self.add_weight("b1", [d_in], trainable=True)
        self.b2 = self.add_weight("b2", [d_in], trainable=True)
        self.fc1 = LAYER(self.ai, units=d_in)
        self.fc2 = LAYER(self.ai, units=d_in)
        self.out = nature.Fn(self.ai)
        super().build(shape)

    @tf.function
    def call(self, x):
        # inner
        V_h = self.fc1(self.state)
        W_x = self.fc2(x)
        d1 = self.alpha * V_h * W_x
        d2 = self.b1 * V_h + self.b2 * W_x
        z_t = tf.nn.dropout(tf.nn.tanh(d1 + d2 + self.z_t_bias), self.p_drop)
        # outer
        gate = tf.nn.sigmoid(W_x + self.gate_bias)
        self.state.assign(self.out((1. - gate) * z_t + gate * self.state))
        return self.state

this resizer helps a ton in making stuff fit together easily. constructor takes a shape and figures out how to make that shape. optionally has a function afterward

class Resizer(L.Layer):

    def __init__(self, AI, out_shape, key=None, layer=None):
        super(Resizer, self).__init__()
        size = get_size(out_shape)
        self.resize = nature.Layer(AI, units=size, layer_fn=layer)
        self.reshape = L.Reshape(out_shape)
        self.fn = nature.Fn(AI, key=key)
        self.out_shape = out_shape
        self.flatten = L.Flatten()
        self.built = True

    @tf.function
    def call(self, x):
        x = self.flatten(x)
        x = self.resize(x)
        x = self.reshape(x)
        x = self.fn(x)
        return x

    def compute_output_shape(self, shape):
        return self.out_shape

this one is along the lines of resnet, but instead of adding, we multiply elementwise (hadamard product) so it's more like SWAG (although I think swag is a dense version) also Quadratic networks

it works insanely well at classification and regression

# The SWAG Algorithm (loosely based on)
# https://arxiv.org/abs/1811.11813

# also similar to a Nth order generalization of:
# Universal Approximation with Quadratic Deep Networks
# https://arxiv.org/pdf/1808.00098.pdf

import tensorflow as tf
import nature

L = tf.keras.layers
LAYER = nature.Layer
MIN_POWER, MAX_POWER = 2, 8

class SWAG(L.Layer):

    def __init__(self, AI, layer_fn=LAYER, units=None):
        super(SWAG, self).__init__()
        power = AI.pull("swag_power", MIN_POWER, MAX_POWER)
        self.zero = LAYER(AI, units=units)
        self.fn = nature.Fn(AI)
        self.layers = []
        for p in range(power):
            np = nature.NormPreact(AI)
            super().__setattr__(f"np_{p}", np)
            one = LAYER(AI, units=units)
            super().__setattr__(f"one_{p}", one)
            self.layers.append((np, one))
        self.addnorm = nature.AddNorm()
        self.built = True

    @tf.function
    def call(self, x):
        ys = [self.zero(self.fn(x))]
        for np, one in self.layers:
            x = np(ys[-1])
            ys.append(x * one(x))
        return self.addnorm(ys)

ConcatCoords lets us add channels for the index of each point in a tensor. that's been shown to dramatically improve results from convolutions

# An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution
# https://arxiv.org/abs/1807.03247
# https://github.com/titu1994/keras-coordconv/blob/master/coord.py
import tensorflow as tf

from tools import normalize, log

K, L, B = tf.keras, tf.keras.layers, tf.keras.backend

# reducer to select the correct rank layer automatically. 
# this could probably all be merged into 1 layer
def Coordinator(shape):
    log("build coordinator with shape:", shape, debug=True, color='green')
    if len(shape) is 2:
        return ConcatCoords2D()
    elif len(shape) is 3:
        return ConcatCoords3D()
    elif len(shape) is 4:
        return ConcatCoords4D()
    else:
        raise Exception(f"{shape} not supported by coordinator")

class ConcatCoords2D(L.Layer):
    def __init__(self):
        super(ConcatCoords2D, self).__init__()
        self.handler = ConcatCoords3D()
        self.built = True

    @tf.function
    def call(self, x):
        x = tf.expand_dims(x, -1)
        return self.handler(x)

    def compute_output_shape(self, input_shape):
        output_shape = list(input_shape)
        output_shape.append(1)
        output_shape[-1] = output_shape[-1] + 1
        return tuple(output_shape)

class ConcatCoords3D(L.Layer):
    def __init__(self):
        super(ConcatCoords3D, self).__init__()
        self.built = True

    @tf.function
    def call(self, x):
        shape = tf.shape(x)
        coords = tf.range(shape[1])
        coords = tf.expand_dims(coords, 0)
        coords = tf.expand_dims(coords, -1)
        coords = tf.tile(coords, [shape[0], 1, 1])
        coords = tf.cast(coords, tf.float32)
        coords = normalize(coords)
        return tf.concat([x, coords], -1)

    def compute_output_shape(self, input_shape):
        output_shape = list(input_shape)
        output_shape[-1] = output_shape[-1] + 1
        return tuple(output_shape)

class ConcatCoords4D(L.Layer):
    def __init__(self):
        super(ConcatCoords4D, self).__init__()

    def build(self, shape):
        h = tf.range(shape[1], dtype=tf.float32)
        w = tf.range(shape[2], dtype=tf.float32)
        h, w = normalize(h), normalize(w)
        hw = tf.stack(tf.meshgrid(h, w, indexing='ij'), axis=-1)
        hw = tf.expand_dims(hw, 0)
        self.hw = tf.tile(hw, [shape[0], 1, 1, 1])
        super().build(shape)

    @tf.function
    def call(self, x):
        return tf.concat([x, self.hw], -1)

    def compute_output_shape(self, input_shape):
        output_shape = list(input_shape)
        output_shape[-1] = output_shape[-1] + 2
        return tuple(output_shape)

"Circulator" is a VAE which predicts inputs, encodes them, then reconstructs them, in a loop, and this way you can get much more training data out of it. It passes along prediction errors so it's good to put L.ActivityRegularization() immediately after this one so we punish it for error (could tinker with long-run surprisal as curiousity signal a la Friston Free Energy

import tensorflow as tf
import nature

L = tf.keras.layers
LAYER_OPTIONS = [nature.Layer, nature.Attention, nature.MLP, nature.SWAG]
LOOP_OPTIONS = [1, 2, 3, 4]

# @tf.function(experimental_relax_shapes=True)
def ERR(true, pred):
    return tf.math.abs(tf.math.subtract(true, pred))

class Circulator(L.Layer):

    def __init__(self, AI, units=None, layer_fn=None):
        super().__init__()
        if not layer_fn:
            self.layer = AI.pull("circulator_layer", LAYER_OPTIONS, id=False)
        self.n_loops = AI.pull("circulator_loops", LOOP_OPTIONS, id=False)
        self.ai = AI

    def build(self, shape):
        self.code = self.add_weight(
            "code", shape,
            initializer=nature.Init(), regularizer=nature.L1L2())
        self.encode = self.layer(self.ai, units=shape[-1])
        self.decode = self.layer(self.ai, units=shape[-1])
        self.fn = nature.Fn(self.ai)
        self.out = L.Add()
        super().build(shape)

    @tf.function
    def call(self, x):
        prev_prediction = prev_reconstruction = x
        prev_code = self.code
        errors = []
        for n in range(self.n_loops):
            prediction = self.fn(self.decode(prev_code))
            code = self.fn(self.encode(prev_reconstruction))
            reconstruction = self.fn(self.decode(code))
            errors.extend([
                ERR(prev_prediction, prediction),
                ERR(x, prediction) * 420.,
                ERR(prev_reconstruction, reconstruction),
                ERR(x, reconstruction) * 420.,
                ERR(prev_code, code)])
            prev_reconstruction = reconstruction
            prev_prediction = prediction
            prev_code = code
        y = self.out(errors)
        return y

finally here's an all-attention transformer, which just means we drop the output layer in the block and instead put those parameters as extra vectors for the attention mechanism. supposedly simpler. I used the Delta RNN to update the memory state, but that's not strictly necessary

import tensorflow as tf
import nature

L = tf.keras.layers

INIT = nature.Init
REG = nature.L1L2
LAYER = nature.Layer
D_MODEL_OPTIONS = [8, 16, 32, 64, 128, 256]
MEMORY_SIZE_OPTIONS = [32, 512]
N_HEADS_OPTIONS = [1, 2, 4]
DROP_OPTIONS = [0., 0.5]
UNITS = None

class Attention(L.Layer):

    def __init__(self, AI, units=UNITS, layer_fn=LAYER):
        super(Attention, self).__init__()
        self.memory_size = AI.pull("attn_memory_size", MEMORY_SIZE_OPTIONS)
        self.d_model = AI.pull("attn_d_model", D_MODEL_OPTIONS)
        self.n_heads = AI.pull("attn_n_heads", N_HEADS_OPTIONS)
        self.p_drop = AI.pull("attn_p_drop", DROP_OPTIONS)
        assert self.d_model % self.n_heads == 0
        self.depth = self.d_model // self.n_heads
        self.delta = nature.Delta(AI)
        self.memory = self.add_weight(
            'memory', (1, self.memory_size, self.d_model), initializer=INIT(),
            regularizer=REG(), trainable=False)
        self.dense = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
        self.wq = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
        self.wk = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
        self.wv = nature.Layer(AI, units=self.d_model, layer_fn=layer_fn)
        self.layer_fn = layer_fn
        self.units = units
        self.ai = AI

    def build(self, shape):
        units = self.units if self.units else shape[-1]
        self.channel_changer = tf.identity
        if units != self.d_model:
            self.channel_changer = nature.Layer(
                self.ai, units=units, layer_fn=self.layer_fn)
        super().build(shape)

    @tf.function
    def split_heads(self, x, batch_size):
        """Split the last dimension into (n_heads, depth).
        Transpose to (batch_size, n_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.n_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    @tf.function
    def call(self, sequence):
        shape = tf.shape(sequence)
        batch_size = shape[0]
        seq_len = shape[1]

        q = self.wq(sequence)  # (batch_size, seq_len, d_model)
        k = self.wk(sequence)  # (batch_size, seq_len, d_model)
        v = self.wv(sequence)  # (batch_size, seq_len, d_model)

        memory = tf.tile(self.memory, [batch_size, 1, 1])
        q = tf.concat([q, memory], 1)
        k = tf.concat([k, memory], 1)
        v = tf.concat([v, memory], 1)

        q = tf.nn.dropout(q, self.p_drop)
        k = tf.nn.dropout(k, self.p_drop)
        v = tf.nn.dropout(v, self.p_drop)

        q = self.split_heads(q, batch_size)  # (batch_size, n_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, n_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, n_heads, seq_len_v, depth)
        # scaled_attention.shape == (batch_size, n_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, n_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, n_heads, depth)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)
        attended = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        # WHERE THE MAGIC HAPPENS:
        attended, memory = tf.split(attended, [seq_len, self.memory_size], axis=1)
        memory = tf.math.reduce_mean(memory, 0, keepdims=True)
        memory = self.delta(memory)
        self.memory.assign(memory)

        attended = self.channel_changer(attended)
        return attended

@tf.function
def scaled_dot_product_attention(q, k, v, mask=None):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead)
    but it must be broadcastable for addition.

    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    output, attention_weights
    """
    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)
    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    # softmax is normalized on the last axis (seq_len_k) so scores sum to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)
    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)
    return output, attention_weights

import tensorflow as tf
import nature

class Transformer(tf.keras.layers.Layer):

    def __init__(self, AI, units=None):
        super(Transformer, self).__init__()
        self.units = units
        self.ai = AI

    def build(self, shape):
        self.attention = nature.Attention(self.ai, units=shape[-1])
        self.add_norm = nature.AddNorm()
        # self.layer = LAYER(units=units)
        # self.add_norm_2 = nature.AddNorm()
        super().build(shape)

    @tf.function
    def call(self, x):
        y = self.attention(x)
        y = self.add_norm([x, y])
        # z = self.layer(x)
        # y = self.add_norm_2([y, z])
        return y

edit: here's a working NoiseDrop (parameter noise + dropconnect) ... i dont recall why i made the functional interface but it definitely works. Init and L1L2 just allow me to switch initializers and regularizers fast

note this one tends to be over-regularized and that can sometimes cause issues with NaNs ... might be good with more units, lower P_DROP

# Parameter Noise: https://openai.com/blog/better-exploration-with-parameter-noise/
# DropConnect: http://yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdf
import tensorflow as tf
from nature import Init, L1L2

INITIALIZER = Init
REGULARIZER = L1L2
ACTIVATION = None
STDDEV = 0.04
P_DROP = 0.5
UNITS = 16

def NoiseDrop(
        units=UNITS,
        activation=ACTIVATION,
        kernel_regularizer=REGULARIZER,
        activity_regularizer=REGULARIZER,
        bias_regularizer=REGULARIZER,
        kernel_initializer=INITIALIZER,
        bias_initializer=INITIALIZER,
        **kwargs):
    return _NoiseDrop(
            units=units,
            activation=activation,
            kernel_regularizer=kernel_regularizer(),
            # activity_regularizer=activity_regularizer(),
            bias_regularizer=bias_regularizer(),
            kernel_initializer=kernel_initializer(),
            bias_initializer=bias_initializer(dist='truncated'),
            **kwargs)

class _NoiseDrop(tf.keras.layers.Dense):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @tf.function
    def add_noise(self):
        return (
            self.kernel + tf.random.truncated_normal(
                tf.shape(self.kernel), stddev=STDDEV),
            self.bias + tf.random.truncated_normal(
                tf.shape(self.bias), stddev=STDDEV))

    @tf.function
    def call(self, x):
        kernel, bias = self.add_noise()
        return self.activation(
                    tf.nn.bias_add(
                        tf.keras.backend.dot(x, tf.nn.dropout(kernel, P_DROP)),
                        bias))

i have lots more but those are some favorites which seem nice for Addons

Wish list:

Deformable Conv https://arxiv.org/abs/1703.06211
Network Deconvolution https://arxiv.org/abs/1905.11926 (supposedly better than batch norm)
Differentiable Neural Dictionary (from Neural Episodic Control) https://arxiv.org/abs/1703.01988
Hypernetworks are definitely high-yield (I tried a wrapper but it breaks in graph mode bc it tries to convert tensors into numpy arrays) ... so we'd need individual layers for this for now :( Dense and Conv1D, Conv2D are usually sufficient for most projects https://arxiv.org/abs/1609.09106
Reservoirs like echo state network have posted some insane results... i think we just need Spectral Normalization near 1.0 http://www.scholarpedia.org/article/Echo_state_network
Optimization Layer (https://arxiv.org/pdf/1703.00443.pdf) / https://locuslab.github.io/qpth/
A "quantizer" layer from VQ-VAE would be "sick nasty" https://arxiv.org/abs/1906.00446
Torch.nn.bilinear looks cool and trivial to implement, like the interpolation part of the Delta RNN
Mixed convolutions https://arxiv.org/abs/1907.09595
Clockwork RNN is a sweet concept https://arxiv.org/abs/1402.3511
perhaps David Ha can help us add Weight Agnostic Neural Networks https://ai.googleblog.com/2019/08/exploring-weight-agnostic-neural.html
Capsules are really interesting https://en.wikipedia.org/wiki/Capsule_neural_network
Neural ODE is clearly groundbreaking and won Best paper at NeurIPS 2018 but so far doesn't have great implementations in TF... can 2.0 help us implement Neural ODE keras layers? https://arxiv.org/abs/1806.07366
Sparsely gated mixture of experts is really cool... constructor could just take a list of layers and automatically gate them https://arxiv.org/abs/1701.06538

I am really interested in Transformer, but it seems that the tensorflow/models has already covered this. Not so sure if we should and if we could maintain a mirror for it. Multi-head attention is also an attractive feature although the single-head attention, tf.keras.layers.Attention, exists in core TF.

I will leave this open as it's a large amount of information. However, I would request that a separate issue be opened if anyone is attempting to work on one of these. That will allow us to discuss the specific layer and properly assign someone to work on it.

Closing this now for maintainability... please feel free to re-open any specific layers as a their own separate issue.

Feel free to point others toward this issue though because there is a good amount of information here. Thanks!

tensorflow / addons

please add more layers #555