statefull? - Githubissues

DefinitlyEvil commented 5 years ago

Hello, I have to apologize that I'm not a math expert, but I wish to implement a GAN network using TCN in the discriminator model to process a generated sequence with dynamic length. Is it possible to make TCN stateful so it can process dynamic length data?

And if yes, is it possible to integrate stateful TCN to seq2seq model?

Thanks so much and I'm very appreciated to you and your project, Tobias.

philipperemy commented 5 years ago

@DefinitlyEvil no problem You would use stateful only when you don't know the whole sequence in advance. For example, imagine you want to train in a streaming fashion. Data is coming to your system one point at a time. Stateful was designed for that. In your case, you can train on sequence of different length without a stateful model. Have a look at this example:

https://github.com/philipperemy/keras-tcn/blob/master/tasks/multi_length_sequences.py

And of course within the same batch all sequences should have the same length. Otherwise just pad them.

philipperemy commented 5 years ago

And thanks for your support :)

DefinitlyEvil commented 5 years ago

@philipperemy Hi, thank you soooo much for your reply. I did look into this example and I found it very helpful. But dynamic length made it impossible to train on a big batch so it will have terrible efficiency. I am trying to use it to generate midi music using Bach's pieces by GAN.

I used PrettyMIDI to get a piano roll at 20 samples/sec and fed it into a generator TCN with nb_stacks=4 and default parameters. The discriminator is the same but its return_sequence = False and Dense(1, activation="sigmoid"). I used the way from the copy memory task example where there is one more dimension telling its time to generate or not. so the input and output of the generator are like(129=128 notes + 1 control flag):

in (None, None, 129):
000000000000000000|111111111111
a short slice of music |0000000000000
----
out(None, None, 129):
000000000000000000|111111111111
000000000000000000| generated music

after hours of training, the model can only produce a single holding note. I switched all final-layer activation functions from softmax to sigmoid but that doesn't help at all.

Could you please give my code an inspection to find the problems?

philipperemy commented 5 years ago

Dense(1, 'sigmoid') with a binary_crossentropy loss is exactly the same as Dense(2, 'softmax') with categorical_crossentropy.
Why do you need a time parameter in the first place? I think you should simplify the problem first. Like truncate all your MIDI files to a length of 128 and make it work. Then, you can add a bit of complexity.

When GANs generate images, they see the image in its whole. Can't you just do the same with your MIDI files? You can see MIDI files as a flattened image.

The idea of the copy memory task is to test that the network can remember a sequence over the course of many steps. You probably don't need that. TCN is a network that takes into input a sequence and outputs whatever you want. Also you probably don't need the temporal feature (unless you want to streaming your output one note at a time).

You would have the generator: random_number -> array of 128 elements.
The discriminator would take this array of 128 elements and outputs 0 or 1, depending whether it thinks it's a legit music or not.

DefinitlyEvil commented 5 years ago

Ah, thank you, now I think I understood more about GAN networks. I think I'm gonna do more experiences, thanks! ;)

philipperemy commented 5 years ago

Great ;)

Luux commented 5 years ago

Just wondering: what would be the best way if you want to replace a stateful lstm? Using a "cache"/sliding window of size (2 * receptive_field - 1) and predict every (receptive_field) timesteps? This way the last new data point still has the entire history that is part of the receptive field. You would have a delay of (receptive field size) timesteps. Or am I wrong? Is there any elegant way to do this for a single timestep while not having to throw away all the information calculated for the previous ones?

philipperemy commented 5 years ago

@Luux yes I know what you mean. At the moment the computations are lost between two calls. In a stateful LSTM you keep the states (c, h) in memory. But in a TCN you dont have such states. So stateful or stateless does not really make sense for TCN. However, we could always cache some computations instead of recomputing all of them. But Im not sure on how to do that. It would definitely be difficult in my opinion. I dont have a clear idea how to replace a stateful LSTM.

Luux commented 5 years ago

@philipperemy Yep, that's the problem. I think that such an approach could be an improvement to State-of-the-Art implementations when it comes to real time prediction of temporal data, where stateful LSTM are still the way to go as far as I know.

I have no idea if this is even possible with keras; I think it should be somehow if you take pure tensorflow, but an implementation of this is currently beyond my knowledge.

philipperemy commented 5 years ago

@Luux yes I agree with you. A pure tensorflow implementation is definitely do-able. In Keras, I'm not even sure it's possible. I guess it would be but would require a lot of work.

beasteers commented 5 years ago

If I recall correctly, I've seen some wavenet implementations (and I think the original paper too?) that use a FIFO queue as a way of tracking states for the generative model. I'll have to see if I can track down the repo that I'm thinking of, I don't see it right now.

Update: Just found one: https://github.com/ibab/tensorflow-wavenet/blob/3c973c038c8c2c20fef0039f111cb04139ff594b/wavenet/model.py#L451

I'm not sure how queues play with backprop+Keras so it may or may not integrate well

beasteers commented 5 years ago

Here's a PyTorch implementation that seems to use a (custom implemented) DilatedQueue class alongside Conv1D layers (instead of the hand-rolled convolutions from the tf implementation) so it might be a useful example.

Luux commented 5 years ago

@beasteers You probably forgot the link :P It's this one I guess? https://github.com/vincentherrmann/pytorch-wavenet/blob/master/wavenet_model.py

Luux commented 5 years ago

@philipperemy @beasteers I don't have the time this or next week to further look into this (maybe after that :P), but the implementation @beasteers found leads to this paper that proposes special queues for caching already calculated activations in Wavenet: https://arxiv.org/abs/1611.09482 I've only had a quick look at this by now, but that could be the one we're looking for.

beasteers commented 4 years ago

Hey @philipperemy @Luux! I hope you've been well. :)

My work diverged a bit, but now I'm starting to look back on TCNs and this was one of the issues that I wanted to get implemented.

Currently, I'm trying to train some TCN layers on top of a fairly large CNN that processes in 1 second chunks. I'm aiming for a larger receptive field (about 10-15 seconds or so), but my coworker has been expressing concern about memory constraints from duplicating the model that many times. So that got my mind back on the idea of stateful TCNs.

Working off of: https://github.com/tomlepaine/fast-wavenet/blob/master/wavenet/models.py

NOTE: They only use queues for generating - does it make sense to allow them for training as well?

I'm trying to formulate it in a way that doesn't require duplicating the existing architecture definition, so I've started on a Conv1D drop-in replacement.

But the outstanding problems/questions are:

tf.fifoqueue has no way of getting values from the queue without dequeueing. This is a problem because the current assumption in the above library is that the input is a single sample, and the kernel size is 2. But to support one sample input with a kernel size of 3, you'd have to dequeue 2, but only step forward 1.
How do we handle return_sequences? Currently this returns sample by sample
When do we reset states? I'm not sure how LSTMs control this, but that's probably the place to look
Is extending Conv1D the right approach? this seems like the least opinionated way of doing it, but idk if there are benefits of doing it differently

please let me know your thoughts! 🙃

import tensorflow as tf

class QueueConv1D(tf.keras.layers.Conv1D):
    def reset_states(self):
        self.q.enqueue_many(tf.zeros(self.dilation_rate + tuple(self.kernel.shape)))

    def build(self, input_shape):
        super().build(input_shape)
        # initialize queue
        self.q = tf.queue.FIFOQueue(
            self.dilation_rate[0], dtypes=tf.float32, shapes=self.kernel.shape)
        self.reset_states()

    def call(self, inputs):
        # valid for kernel_size in (1, 2)
        outputs = tf.matmul(inputs, self.kernel[0])
        if self.kernel_size[0] > 1:
            outputs += tf.matmul(self.q.dequeue(), self.kernel[1])
            self.add_update(self.q.enqueue([inputs]))

        # TODO: how to support other kernel/input sizes????
        #       but we can't get values from the queue without dequeueing so this doesn't work...
        # outputs = (
        #     tf.matmul(inputs, self.kernel[0])[None] + 
        #     tf.matmul(self.q[:len(self.kernel) - 1], self.kernel[1:])
        # )
        # self.add_update(self.q.enqueue([inputs]))

        if self.use_bias:
            dformat = 'NCHW' if self.data_format == 'channels_first' else 'NHWC'
            outputs = tf.nn.bias_add(outputs, self.bias, data_format=dformat)

        if self.activation is not None:
            return self.activation(outputs)
        return outputs

Other notes:

I'm not sure if we should be building keras layers using eager mode or if we should still be using tf.keras.backend.function and whatnot idk - eager mode confuses me

philipperemy / keras-tcn

statefull? #33