tensorfreitas / Siamese-Networks-for-One-Shot-Learning

Implementation of Siamese Neural Networks for One-shot Image Recognition
601 stars 180 forks source link
keras omniglot one-shot-learning siamese-network

Siamese-Networks-for-One-Shot-Learning

This repository was created for me to familiarize with One Shot Learning. The code uses Keras library and the Omniglot dataset. This repository tries to implement the code for Siamese Neural Networks for One-shot Image Recognition by Koch et al..

One-Shot Learning

Currently most deep learning models need generally thousands of labeled samples per class. Data acquisition for most tasks is very expensive. The possibility to have models that could learn from one or a few samples is a lot more interesting than having the need of acquiring and labeling thousands of samples. One could argue that a young child can learn a lot of concepts without needing a large number of examples. This is where one-shot learning appears: the task of classifying with only having access of one example of each possible class in each test task. This ability of learning from little data is very interesting and could be used in many machine learning problems.

Despite this paper is focused on images, this concept can be applied to many fields. To fully understand the problem we should describe what is considered an example of an one-shot task. Given a test sample, X, an one-shot task would aim to classify this test image into one of C categories. For this support set of samples with a representing N unique categories (N-way one shot task) is given to the model in order to decide what is the class of the test images. Notice that none of the samples used in this one-shot task have been seen by the model (the categories are different in training and testing).

Frequently for one-shot learning tasks, the Omniglot dataset is used for evaluating the performance of the models. Let’s take a deeper look to this database, since it was the dataset used in the paper (MNIST was also tested but we will stick with Omniglot).

Omniglot Dataset

Omniglot Dataset

The Omniglot dataset consists in 50 different alphabets, 30 used in a background set and 20 used in a evaluation set. Each alphabet has a number of characters from 14 to 55 different characters drawn by 20 different subjects, resulting in 20 105x105 images for each character. The background set should be used in training for hyper parameter tuning and feature learning, leaving the final results to the remaining 20 alphabets, never seen before by the models trained in the background set. Despite that this paper uses 40 background alphabets and 10 evaluation alphabets.

This dataset is considered as sort of a MNIST transpose, where the number of possible classes is considerably higher than the number of training samples, making it suitable to one-shot tasks.

The authors use 20-way one-shot task for evaluating the performance in the evaluation set. For each alphabet it is performed 40 different one-shot tasks, completing a total of 400 tasks for the 10 evaluation alphabets. An example of one one-shot task in this dataset can be seen in the following figure:

One-Shot Task

Let's dive into the methodology proposed by Kochet al. to solve this one-shot task problem.

Methodology

To solve this methodology, the authors propose the use of a Deep Convolutional Siamese Networks. Siamese Nets were introduced by Bromley and Yan LeCun in the 90s for a verification problem. Siamese nets are two twin networks that accept distinct inputs but are joined in by a energy function that calculates a distance metric between the outputs of the two nets. The weights of both networks are tied, allowing them to compute the same function. In this paper the weighed L1 distance between twin feature vectors is used as energy function, combined with a sigmoid activations.

This architecture seems to be designed for verification tasks, and this is exactly how the authors approach the problem.

In the paper a convolutional neural net was used. 3 Blocks of Cov-RELU-Max Pooling are used followed by a Conv-RELU connected to a fully-connected layer with a sigmoid function. This layer produces the feature vectors that will be fused by the L1 weighed distance layer. The output is fed to a final layer that outputs a value between 1 and 0 (same class or different class). To assess the best architecture, Bayesian hyper-parameter tuning was performed. The best architecture is depicted in the following image:

best_architecture

L2-Regularization is used in each layer, and as an optimizer it is used Stochastic Gradient Descent with momentum. As previously mentioned, Bayesian hyperparameter optimization was used to find the best parameters for the following topics:

For training some details were used:

Implementation Details

When comparing to the original paper, there are some differences in this implementation, namely:

Code Details

There are two main files to run the code in this repo:

Both files store the tensorflow curve logs that can be consulted in tensorboard (in a logs folder that is created), also the models with higher validation one-shot task accuracy are saved in a models folder, allowing to keep the best models.

Regarding the rest of the code:

Notes:

References

Credits

I would like to give credit to a blog post that introduced me to this paper, when I was searching for Siamese Networks. The blog post also includes code for this paper, despite having some differences regarding this repo (Adam optimizer is used, layerwise learning-rate option is not available). It is a great blog post go check it out: