tristandeleu / ntm-one-shot

One-shot Learning with Memory-Augmented Neural Networks
MIT License
421 stars 94 forks source link

Input data #3

Closed link-er closed 7 years ago

link-er commented 8 years ago

Hello! I am trying to implement this model too, in Tensorflow. But I am really lost in several moments.

First of all can you share please how did you organize your input data? From the paper I understood, that unique classes were chosen for each episode, per 10 examples for each. So is it like getting 50 examples (5 classes with 10 images each) for episode? As I can see from your implementation you did not choose random examples every time. And by episode I mean that after it the memory will be wiped and accuracy calculated

tristandeleu commented 8 years ago

Each episode is sampled from the OmniglotGenerator generator using the OmniglotGenerator.sample() method. Here is how it works:

omniglot
|-- Alphabet_of_the_Magi
    |-- character01
        |-- 0709_01.png
        |-- 0709_02.png
        |-- ...
        |-- 0709_20.png
    |-- character02
    |-- ...
    |-- character20
|-- Anglo-Saxon_Futhorc
    |-- character01
    |-- ...
|-- Arcadian
    |-- character01
    |-- ...
|-- ...

That being said this method is not quite the proper way to sample episodes either. Here I am always sampling 50 images per sequence inside the episode. However to prevent the model from just learning to count, one should sample between 10 and 20 images of the 5 characters for each sequence (not always the same number 10).

link-er commented 8 years ago

Hm, it might be that I understood the dataset completely wrong. What is considered to be a label - character number or the alphabet? It is just that I have different number of characters in different language folders, so if I will choose some 25th character I will not be able to sample it from every alphabet - or it is ok?

tristandeleu commented 8 years ago

A class is a character number from an alphabet. To pick a character, you sample from all the characters from all the alphabets. Inside the self.character_folders you have something like:

['data/omniglot/Mkhedruli_(Georgian)/character30',
 'data/omniglot/Arcadian/character23',
 'data/omniglot/Malay_(Jawi_-_Arabic)/character25',
 'data/omniglot/Ojibwe_(Canadian_Aboriginal_Syllabics)/character05',
 'data/omniglot/Mkhedruli_(Georgian)/character37',
 'data/omniglot/Gujarati/character36',
 'data/omniglot/Malay_(Jawi_-_Arabic)/character22',
 'data/omniglot/Futurama/character02',
 'data/omniglot/Anglo-Saxon_Futhorc/character19',
 'data/omniglot/Japanese_(hiragana)/character16',
 'data/omniglot/Anglo-Saxon_Futhorc/character17',
 'data/omniglot/Japanese_(hiragana)/character37',
...
 'data/omniglot/Balinese/character19',
 'data/omniglot/Japanese_(katakana)/character10',
 'data/omniglot/Tifinagh/character50',
 'data/omniglot/Sanskrit/character39']
link-er commented 8 years ago

Oh, that is it! Thanks.

One more question about training. I was not able to implement batch training, I am calculating error for every input given and propagate it. Can it affect the process badly? I.e. is it critical to get cost of the whole batch and only after apply gradients?

tristandeleu commented 8 years ago

I don't think I quite get the way you train the model.

If you update the weights after each sequences in an episode separately, then you lose the advantage of batch learning (ie. less noisy estimates of the gradients). Updating the weights given all the sequences in the episode should not be too difficult if you use something like tf.scan (or even tf.dynamic_rnn maybe?).

link-er commented 8 years ago

Actually I am almost sure that the way of training is not correct, but I just cannot figure out how to implement it correctly with TF.

What am I doing now is simply getting one pair (image-shifted label) by other and minimizing cost:

for epoch in range(training_epochs):
        store_memory = np.zeros((mem_height, mem_width))
        sum_cost = 0.
        sum_acc = 0.
        # get the data 
        batch_x, batch_y = generate_data(batch_size)
        # shift labels with 0
        shifted_y = np.vstack((np.zeros(n_classes), batch_y[:-1]))
        store_u_weights = np.random.rand(mem_height)
        store_r_weights = np.random.rand(mem_height)
        min_index = 0
        store_least_used = np.zeros(mem_height)

        for i in range(batch_size):  
            # find n-th smallest element, where n is number_of_reads=1
            min_index = np.argmin(store_u_weights)
            store_least_used = np.zeros((mem_height))
            store_least_used[min_index] = 1

            #  Prior to writing to memory, the least used memory location is computed and is set to zero
            store_memory[min_index] = np.zeros((mem_width))

            feed_dict = {
                train_pairs: np.concatenate((batch_x[i], shifted_y[i])), 
                labels: batch_y[i],
                keep_prob: 0.5,
                p_least_used: store_least_used,
                p_u_weights: store_u_weights,
                p_r_weights: store_r_weights,
                memory: store_memory }

            store_memory, store_u_weights, store_r_weights, _, c, acc = sess.run([new_memory, u_weights, r_weights, optimizer, cost, accuracy], feed_dict = feed_dict)
            sum_cost = sum_cost + c
            sum_acc = sum_acc + acc

        print "Epoch", epoch, "loss:", sum_cost/(batch_size * 1.0), ", accuracy:", sum_acc/(batch_size * 1.0)

So everything is updated after each example in batch_size_unique_classes_examples. And actually I do not quite understand how it can be done other way - memory should be changed after each seen input?

tristandeleu commented 8 years ago

Each sequence in an episode has its own memory. Contrary to your store_memory, the memory is not shared between the sequences inside an episode. The way I did that was to consider the memory as a 3D Tensor with shape (batch_size, memory_size[0], memory_size[1]) (instead of a matrix with shape (memory_size[0], memory_size[1])).

Likewise, the different weight vectors and the read vector are not shared inside an episode and are matrices with shape (batch_size, ...) (all these elements are initialized here, and are updated as one processes the batch_size sequences in an episode in parallel).

link-er commented 8 years ago

But what is the reason to change the memory if the classes are the same? As I understood from the paper the reason to clean the memory is that classes differ between episodes, but in one episode all the sequences have the same 5 classes, right?

The way I did that was to consider the memory as a 3D Tensor with shape (batch_size, memory_size[0], memory_size[1]) (instead of a matrix with shape (memory_size[0], memory_size[1])).

But how the sequence is processed then? Memory for one sequence should be changed after each input and least used array as well as all weights should be updated on every step. How you are getting this summed up loss then afterwards? I can work with batch, but I do not get how to work with sequence.

And by the way, weights and memory are not the things that should be updated with backpropagation, right? Because NTM paper says that they are updated too, but in this paper as I understood they are just calculated everytime by corresponding formulas

tristandeleu commented 8 years ago

Indeed the classes are the same, but the sequences are actually different examples in an episode. Think about it as the way you would train a standard RNN with mini-batches (eg. an LSTM). The model processes the sequences in a mini-batch in parallel, and updates the hidden states in parallel as well. But these hidden states (and memory cells for a LSTM) are not shared between the sequences inside the mini-batch, each sequence has its own hidden state that is being updated.

The same holds here. Except that instead of having simple objects like the hidden state (a vector), you have to update something more complex like a memory matrix.

And by the way, weights and memory are not the things that should be updated with backpropagation, right? Because NTM paper says that they are updated too, but in this paper as I understood they are just calculated everytime by corresponding formulas

What you update with backpropagation are parameters of the model, like the weight matrices in your controller (the LSTM), as well as the weights matrices and biases used to update the read and write weights. These parameters are summed up here (the names are not always too explicit unfortunately).

This should not be confused with updating the memory and read/write/least used weight vectors during the forward pass (this is totally different from backpropagation, and similar to updating the hidden states in a standard RNN).

link-er commented 7 years ago

Thanks a lot! You helped really much to understand how it works