hrantzsch / signature-embedding

A Deep Metric Learning Approach to Signature Verification
GNU General Public License v3.0
11 stars 3 forks source link

Question about loss calculation #1

Open Nbeleski opened 6 years ago

Nbeleski commented 6 years ago

For the MNIST implementation, you have the last layer of the network with 10 outputs (one for each class).

How would that work with signature-embedding, where there isn't a explicit number of classes?

(Based on the paper) Is the loss calculated over the 128 outputs of the last fully connected layer?

Thank you!

hrantzsch commented 6 years ago

The loss isn't calculated directly on the output of the DNN and the network does not output a specific "class" for a signature.

Instead, the DNN (in the MNIST script that is the MLP) is used to embed the input samples (either numbers or signatures) into the euclidean space. That means the number of outputs of the final fully-connected layer of the DNN is really the number of dimensions of the space we embed the sample in.

The loss calculation is not simply done on the output of the DNN, but by comparing the distances of multiple (three) embedded samples. Fig. 1 in the paper gives you an overview. This way of loss calculation is the essence of deep metric embedding.

So if you then want to use the trained model, it doesn't really tell you a class for an input sample. It just embeds it into euclidean space. This however allows you to compare it (by distance) with another embedded sample, indicating how similar they are. The idea is then obviously that samples of the same author should be similar.

You can find a more detailed explanation of deep metric embedding in Hoffer and Ailon: Deep metric learning using triplet network. Another good example of how it can be used is Schroff et al.: Facenet.

Did that help?

Nbeleski commented 6 years ago

I've understood how the loss is calculated, and the paper was indeed incredibly helpful in that regard. More reading on the matter is always welcome, thank you for the links. About this bit:

That means the number of outputs of the final fully-connected layer of the DNN is really the number of dimensions of the space we embed the sample in.

and,

Usually in convolutional networks, a subsequent fully-connected layer is used for classification. In our net this layer is removed, as we are interested in a feature embedding only (Hoffer and Ailon).

These bits were the parts I was missing. If you don't mind me asking, having a larger dimensional space (output) might better differentiate inputs? I have seen implementations for the mnist data-set with 2 outputs on the final layer, but I would guess that for the signatures more is necessary.

This was helpful, yes. Thank you very much. It seems my confusion was much simpler than I thought, yet I'm happy I got in touch.

I am currently trying my hand implementing a signature matching DNN as an exercise using Keras and I felt the triplet embedding network is a very interesting approach.

Thank you again!

hrantzsch commented 6 years ago

That is exactly right. For MNIST, 2 output layers could be enough. This has the additional advantage that's it's easy to plot the embedding, so you can see if nice clusters are formed. (To visualize higher dimensional embeddings you could use a PCA.)

I'm glad I could help you, good luck with your project :)