bunch of questions - Githubissues

I appreciate your quick response. I successfully finished the training and got O1 score arount 66.0 for CUB dataset. I think it makes sense for now.

I actually came up with some other curiosities during the process.

Can I change backbone with pretrained model?
- I am now suceesfully training the Horde model with Resnet backbone, but not sure if it is the correct way. (I am using tf.keras.application.ResNet50 model and load imagenet pretrained weight as backbone.)
- Is it requiring specified trained model for loading? -> I am now suceesfully training the Horde model with Resnet backbone, but not sure if it is the correct way.
After training, I am planing to add custom dataset class and train & test with it. I am wondering if there exists testing script that I can refer as well. If not, will it be something like using mdl.predict after loading Horde model?
- also, when implementing custom dataset, could you tell me if there is certain standard to set options like number of high order moments, images per class, classes per batch...etc?
I can see that the code is splitting train and test set using their labels. As I have understood, for example, in MNIST, the model is using 0 to 4 labeled images to train and using 5 to 9 labeled images for testing. If it is the case, how they can measure performance? Is it the process like making 'gallery' with the certain amount of test dataset and making 'probe' with rest amount of test dataset and get R@K of them?

Sorry for lots of question at a same time. Again, thank you so much for the advice you have given me.

You’re welcome !

Here are the answers to your questions:

Can I change backbone with pre-trained model?

Yes, you can use other pre-trained backbone. In the case of ResNet, you must drop the classification layers (set “include_top” to False). On top of it, you must add a global pooling (e.g., global average pooling or global max pooling) to output a single vector. Optionally, you might add a Dense layer if you want to reduce the size of the embedding (with ResNet50, usually we use 128 or 512 in metric learning) and a L2Normalisation layer (it helps convergence and avoid norms to go to infinity). Note that if you don't add the L2Normalisation layer, you should change the similarity from 'cosine' to 'l2' in the training code.

The only requirement is that your network outputs a single vector. Otherwise, you can use any other networks, pre-trained or not.

Custom dataset class and train & test with it

To add a custom dataset, the simple way is to create a class that inherit from “RetrievalDb” (kerastools / databases / databases.py). Example of available datasets are the Cub or Cars dataset that you can take as an implementation reference. In practice, each function should return a numpy ndarray composed of either the raw images or the images paths, and the labels.

Such parameters are independent from the datasets and they usually work well with the given configuration. The only recommendation that I can give in about the number of class per batch and images per class: Try to have the larger batch size possible and nearly the same number classes and images per class. It tends to generalize better (as shown in recent metric learning papers). However, you might have datasets with fewer images per class. In that case, I recommend to set the number of images per class set to the minimum number of images from the least populated class in the dataset. It will avoid some issue where the metric is trained using two times the same image as the query and its positive example.

Splitting train and test set using their labels

Splitting classes between train and test is only due to the evaluation protocol of metric learning. For example, in person re-id, they have the same classes in train and test. Recall@K is computed by taking all “query_idx” to average the ranking scores. Then, the retrieval part is performed on the collection set using the “collection_idx” and the ranking is evaluated only on those images. With the current code, you can have the same classes as in the train set. However, remark that the collection set and the query set must not overlap if the option “queries_in_collection” is set to true.

I hope that answers your questions. If not, please feel free to comment again.

It is so helpful indeed.

I may have been misunderstanding about testing scenario. can you briefly explain once more about comments that you mentioned above, Recall@K is computed by taking all “query_idx” to average the ranking scores. Then, the retrieval part is performed on the collection set using the “collection_idx” and the ranking is evaluated only on those images.'? And what if I change the code like below?

test_indexes_query = dataset.get_queries_idx(db_set='test')
test_indexes_collection = dataset.get_collection_idx(db_set='train')

Question of testing

Before starting to test new dataset, I would like to make test script. Since the training generate .npy model file along with the epochs, it seems like that I need to load .npy weight after defining model. But when I checked the size of .npy file, it just had 800 bytes in it.
I found that you've said in the paper that we need to concat all the output of HORDE and reduce dimension of them into 512 using PCA when we test, but I could not find the process to deal with those things. (it may exist in code somewhere, My apologies.)
- When I tried define model and load weights form .h5 file, it always returned layer error that is saying 'trying to load a weight file containing 147 layers into a model with 143 layers (BNInception example)), and it seems like because we don't add Dense layer if ho_trainable is False. --> SO If I add concatenating layer as the paper mentioned, is it possible to load weight successfully?.
Can I kindly ask you to guide me to test the model?
- in shell script of source code, which arguments and lines should I properly use? -use_horde -use_abe -cascaded - ...etc it would be very useful if there is guide for setting argument for the test case

Could you take time to see those questions? I appreciate for your response.

Best regards,

Precision about "query_idx"

Let's say you have a test set composed of M images and the 50 first are from the query set. You have to set the query and collection indexes as follows:

test_indexes_query = np.arange(0:50)
test_indexes_collection = np.arange(50:M)
queries_in_collection = False

This uses the first 50 images to query the collection set, and the rest of the images (M-50) as the collection set. R@K is averaged only over the 50 query images.

Also, you cannot do the following lines:

test_indexes_collection = dataset.get_collection_idx(db_set='train') # Should raise error

because the evaluation is only done on the test set only, and it doesn't have access to the training images. Thus, it will take the above indexes into the test set, and might raise error.

If you really want to use training images during test (e.g., to use them inside the collection), your test set must contain the training images, and their respective indexes should be added into the 'test_indexes_collection '.

For example, if you have M images for the query set, and you want to use the N images from the training set into the collection set, you must change the 'get_testing_set' function with something like that:

def get_testing_set(self):
    x_train, y_train = self.get_training_set()
    x_test = np.concatenate([self.x_query, x_train], axis=0) # Now with size M + N
    y_test = np.concatenate([self.y_query, y_train], axis=0) # Now with size M + N

    return x_test, y_test

In the training script, you should change 'test_indexes_query', 'test_indexes_collection' and 'queries_in_collection' as follows:

test_indexes_query = np.arange(0:M) # The first M images are used into the query set
test_indexes_collection = np.arange(M:M+N) # The rest are used into the collection set
queries_in_collection = False

Or by adding these lines into 'get_collection_inx' and 'get_queries_idx' to avoid changes in the training script.

Testing

Test Script

If you simply want to test the model without training, you have to make a new test script (which is mostly a copy-paste from the training set, up to compile and fit functions). The save part should save two things:

The predictions for each image in the test set (with ‘preds’ in its name)
The full model, including HORDE.

For example, in my log folder, I have:

HORDE_Cascaded_BNInception_RET_CUB_200_2011_3_orders_contrastive_loss_best_weights_epoch_0006.h5” (170MB)
HORDE_Cascaded_BNInception_RET_CUB_200_2011_3_orders_contrastive_loss_preds_Recall(at)1_66.2_epoch_0006.npy (35MB)

You should load the first one, which contains all model's layers, including those from HORDE. The second one is only here to check things on the predictions (like doing t-SNE, computing other metrics like mAP, ...)

Evaluation with concatenation

The expected usage of HORDE is to simply drop the high-order moments at testing (like scores reported during training). This concatenation part is not mandatory, but it slightly increases performances, albeit at a huge cost in memory (larger model due to the high-order moments). As such, we don’t have included the script to evaluate HORDE in this setup. Yet, the implementation is simple and can be done as follows:

Load test predictions (e.g., HORDE_Cascaded_BNInception_RET_CUB_200_2011_3_orders_contrastive_loss_preds_Recall(at)1_66.2_epoch_0006.npy)
Concatenate all outputs
Compute the PCA projection matrix (e.g., using scikit-learn)
Project the full representation into the 512 first dimensions related to the largest eingenvalues
Recompute Recall@K

If you have difficulties to implement it, I should have the (quick and dirty) script somewhere on my computer.

Error while loading parameters

This error comes from the fact that the "--trainable" argument is missing. In fact, this argument adds fully connected layers on top of the high-order moment layers, in order to learn a metric, but also make the weights trainable in the moment approximations.

Model's arguments

I thought all parameters have a small description when using python3 train.py --help command… I was thinking of making an update of the repo to TF1.5 and TF2.2 (hopefully in June, but more certainly in July), so I will add them at that time. Meanwhile, here are some explanations about them:

use_horde : Add the HORDE regularization. Without this argument, you can train baseline network
use_abe : Use the ABE-8 network based on GoogleNet architecture. It cannot be changed to BNInception for instance, because it is tricky to implement in practice. It still needs the --feature argument, but it is not taken into account.
cascaded: use the recursive implementation of HORDE. It only uses N layers to approximate the N first high-order moments instead of N(N+1)/2 layers. This argument should always be added, and it was only set to False by default in order to reproduce our ablation studies.
--trainable: Make the weights in the high-order moment approximation trainable, and add a metric layer (FC + L2Normlisation) on top of each of them. Also, this argument should always be added, it was only set to False by default in order to reproduce our ablation studies.

I hope I made things clear 😊 If you have more questions, do not hesitate !

Thanks a lot for detailed feedback. Many things became clear to me now.

About testing script, since I have not much background of Linear algebra methods, if you can support that on git page, it would be great for me so that I can refer those. (I fully understood from the extract features to concatenating them, and the only part that I need to more understanding is projecting representation related to eigenvalues)

I have one more thing to clarify. what I am thinking of making is, feature similarity comparison between known set and unknown set.

Suppose I have Dataset A which has around 50,000 images of 1,000 classes and Dataset B which has around 30,000 images of 700 classes. ( all the 700 classes are included in 1,000 classes of Dataset A)

What I'd like to do is, extract feature of Dataset A to have 50,000 x 512 and make them as collection DB, and after then, extract feature of Dataset B to have 30,000 x 512 . Then, put two feature DB into compute metric function or just normal distance function to get the closest image in DB 'A' with image of DB 'B'. For this, you mentioned above that Dataset B must have Dataset A. And I am not fully get it. (Isn't it possible to find Top-1 closest image from other dataset with feature distance even though I don't have it in the query set?)

I once tried to make feature DB of Dataset A and compare Dataset B extracted feature with DB 'A', but it returned about 2% of acc. compared to 92% of acc. in between DB 'B' and DB 'B' --> I really don't know why the performance differs a lot. The only difference between two tests are whether the data are already in the DB or not, and I even don't use Top-1 which must be itself.

Below is what I have implemented for extracting features.
```
preds = tf.concat([predictions[0],predictions[1],predictions[2],predictions[3],predictions[4]], axis=1, name='concat')
pca = PCA(n_components=512) # set number of principal components
printcipalComponents = pca.fit_transform(preds)
```

What I have thought for now is, feature extraction with model.predict() has huge dependency on dataset sampling and collection set generation. This can be the reason of dramatic performance difference. In here, I came up with the idea that what if I add softmax loss along with the dml losses which are being used in the paper. Can you give me any idea about this trial?

(P.S. I am now implementing with TF2.1 version, and with just changing global_metric part from tf.session one to tf.function decorated function, it is perfectly working!)

I appreciate again for your guide.

Hey, sorry for the delay, I got a bit overwhelmed recently…

Ok, I got it. In this case, it is very simple: your test set should be composed of 80,000 images that are the concatenation of your 50,000 training images followed by your 30,000 testing images.

You put the 50,000 first indexes into the collection set,
You put the last 30,000 indexes into the query set,
You set the variable “queries_in_collection” to false.

The evaluation should work, and hopefully, it will give you good results!

You simply have to ensure that images' labels from A and from B are the same for a given class. Note that it will work in this case because your test labels are the same as your training labels (well, they are a subset, but it doesn't matter). If there are few labels that don’t overlap, it will not work!

Your “concat+PCA” implementation is correct.

Concerning the TF2.2 version, I have a ``cleaned version'' which is better than the current implementation (loss functions are simpler and clearer, metrics are more general, there are more comments, ...). I hope to find some time to post a new version of the repo (definitely in July!).

If you have more questions, do not hesitate!

Thanks for the answers :) I am also implementing and conducting some experiments along with your feedback now. I'll close this issue!

pierre-jacob / ICCV2019-Horde

bunch of questions #7

Question of testing

Precision about "query_idx"

Testing