Closed kouxichao closed 5 years ago
TLDR: try an approach like the one in this paper, and if you really want to use the SiamFC, post the code that is giving you the NaN's. Have you made sure that your images are in the correct range? The network expects pixels in the range [0, 1] and not [0, 255].
Warning: Big text ahead
First of all, I don't believe this approach is the most appropriate for your problem. The SiamFC has a tradeoff between its semantic and spatial information capacity. To understand what I mean, see the following picture that illustrates how the network maps between the input (left) and its embedding (right). As the network is Fully Convolutional, the embedding is a 2D mapping, and each element is a function of a restricted region of the input (shown with the semitransparent colored masks on the input). The size of these regions is the Receptive Field of the output layer, which depends on the network's parameters and can be changed. Each element of the embedding, then, only "sees" a part of the input, and can only learn based on that restricted region, so the kind of semantic categories it can learn are more restricted the smaller the receptive field. Simplifying a lot, in the image, we could say that the network could be capable of learning and being able to discriminate the head or the belly of a zebra, but not the whole zebra, since the receptive field is not big enough to cover the zebra completely. By contrast the network preserves some of the spatial information about the input image. Even though we don't know what output the network will give for the head, neck and belly of the zebra, we know that in the embedding they will be ordered just as they are ordered in the input. The larger the receptive field the more semantic information the network might be able to learn, but the smaller the size of the embedding (I'll let you figure why, hint: think of a FullyConv net as a single conv layer with a kernel with the size of the receptive field), and the less spatial information the embedding has. You can see this phenomenon in the following image, where we change the receptive field of the network (phi) and when the receptive field is equal to the image size, the embedding is a single vector instead of a 2D mapping of vectors.
One of the points of the SiamFC architecture is that the correlation operation is highly dependent on the spatial information, and can work well even if the network hasn't "learned" to encode that much semantic information, but that assumes that most of the scene is not going to change between your template and your search image, if you're going to compare objects of the same class in different scenes, with very different poses, etc, what you want is to have a network that has some sort of semantic invariance, like the original facenet paper, where faces from the same subject are mapped near to each other, independently of the pose, backgroud, etc.
In other words: The tracking problem assumes that the appearance of the object doesn't change that much between close frames. If your problem doesn't have this assumption I would suggest a similarity function based on the euclidean distance between the embeddings instead of the SiamFC's correlation based one.
As you said, SiamFC is not suitable for this type of problem, I intend to give up this approach. the RepMet you suggested is exactly what I want to use, I will try it.
Thank you very much for your detailed answers and suggestions.
I want to use this project to achieve the object matching function, calculate the features of the two images like face recognition and then get the similarity measure to judge whether it is the same object.
I calculate the score map directly on the two object images, and often get the results of nan, inf and so on. I saw that the last convolution of the model was to use the features of the template image as a weight to convolve the input image, which caused a large number of values.
I don't know if there are any good suggestions, thank you.