pierre-jacob / ICCV2019-Horde

Code repository for our paper entilted "Metric Learning with HORDE: High-Order Regularizer for Deep Embeddings" accepted at ICCV 2019.
MIT License
83 stars 10 forks source link

bunch of questions #7

Closed mhyeonsoo closed 4 years ago

mhyeonsoo commented 4 years ago

I appreciate your quick response. I successfully finished the training and got O1 score arount 66.0 for CUB dataset. I think it makes sense for now.

I actually came up with some other curiosities during the process.

  1. Can I change backbone with pretrained model?

    • I am now suceesfully training the Horde model with Resnet backbone, but not sure if it is the correct way. (I am using tf.keras.application.ResNet50 model and load imagenet pretrained weight as backbone.)
    • Is it requiring specified trained model for loading? -> I am now suceesfully training the Horde model with Resnet backbone, but not sure if it is the correct way.
  2. After training, I am planing to add custom dataset class and train & test with it. I am wondering if there exists testing script that I can refer as well. If not, will it be something like using mdl.predict after loading Horde model?

    • also, when implementing custom dataset, could you tell me if there is certain standard to set options like number of high order moments, images per class, classes per batch...etc?
  3. I can see that the code is splitting train and test set using their labels. As I have understood, for example, in MNIST, the model is using 0 to 4 labeled images to train and using 5 to 9 labeled images for testing. If it is the case, how they can measure performance? Is it the process like making 'gallery' with the certain amount of test dataset and making 'probe' with rest amount of test dataset and get R@K of them?

Sorry for lots of question at a same time. Again, thank you so much for the advice you have given me.

pierre-jacob commented 4 years ago

You’re welcome !

Here are the answers to your questions:

  1. Can I change backbone with pre-trained model?

Yes, you can use other pre-trained backbone. In the case of ResNet, you must drop the classification layers (set “include_top” to False). On top of it, you must add a global pooling (e.g., global average pooling or global max pooling) to output a single vector. Optionally, you might add a Dense layer if you want to reduce the size of the embedding (with ResNet50, usually we use 128 or 512 in metric learning) and a L2Normalisation layer (it helps convergence and avoid norms to go to infinity). Note that if you don't add the L2Normalisation layer, you should change the similarity from 'cosine' to 'l2' in the training code.

The only requirement is that your network outputs a single vector. Otherwise, you can use any other networks, pre-trained or not.

  1. Custom dataset class and train & test with it

To add a custom dataset, the simple way is to create a class that inherit from “RetrievalDb” (kerastools / databases / databases.py). Example of available datasets are the Cub or Cars dataset that you can take as an implementation reference. In practice, each function should return a numpy ndarray composed of either the raw images or the images paths, and the labels.

Such parameters are independent from the datasets and they usually work well with the given configuration. The only recommendation that I can give in about the number of class per batch and images per class: Try to have the larger batch size possible and nearly the same number classes and images per class. It tends to generalize better (as shown in recent metric learning papers). However, you might have datasets with fewer images per class. In that case, I recommend to set the number of images per class set to the minimum number of images from the least populated class in the dataset. It will avoid some issue where the metric is trained using two times the same image as the query and its positive example.

  1. Splitting train and test set using their labels

Splitting classes between train and test is only due to the evaluation protocol of metric learning. For example, in person re-id, they have the same classes in train and test. Recall@K is computed by taking all “query_idx” to average the ranking scores. Then, the retrieval part is performed on the collection set using the “collection_idx” and the ranking is evaluated only on those images. With the current code, you can have the same classes as in the train set. However, remark that the collection set and the query set must not overlap if the option “queries_in_collection” is set to true.

I hope that answers your questions. If not, please feel free to comment again.

mhyeonsoo commented 4 years ago

It is so helpful indeed.

I may have been misunderstanding about testing scenario. can you briefly explain once more about comments that you mentioned above, Recall@K is computed by taking all “query_idx” to average the ranking scores. Then, the retrieval part is performed on the collection set using the “collection_idx” and the ranking is evaluated only on those images.'? And what if I change the code like below?

test_indexes_query = dataset.get_queries_idx(db_set='test')
test_indexes_collection = dataset.get_collection_idx(db_set='train')

Question of testing

  1. Before starting to test new dataset, I would like to make test script. Since the training generate .npy model file along with the epochs, it seems like that I need to load .npy weight after defining model. But when I checked the size of .npy file, it just had 800 bytes in it.

  2. I found that you've said in the paper that we need to concat all the output of HORDE and reduce dimension of them into 512 using PCA when we test, but I could not find the process to deal with those things. (it may exist in code somewhere, My apologies.)

    • When I tried define model and load weights form .h5 file, it always returned layer error that is saying 'trying to load a weight file containing 147 layers into a model with 143 layers (BNInception example)), and it seems like because we don't add Dense layer if ho_trainable is False. --> SO If I add concatenating layer as the paper mentioned, is it possible to load weight successfully?.
  3. Can I kindly ask you to guide me to test the model?

    • in shell script of source code, which arguments and lines should I properly use? -use_horde -use_abe -cascaded - ...etc it would be very useful if there is guide for setting argument for the test case

Could you take time to see those questions? I appreciate for your response.

Best regards,

pierre-jacob commented 4 years ago

Precision about "query_idx"

Let's say you have a test set composed of M images and the 50 first are from the query set. You have to set the query and collection indexes as follows:

test_indexes_query = np.arange(0:50)
test_indexes_collection = np.arange(50:M)
queries_in_collection = False

This uses the first 50 images to query the collection set, and the rest of the images (M-50) as the collection set. R@K is averaged only over the 50 query images.

Also, you cannot do the following lines:

test_indexes_collection = dataset.get_collection_idx(db_set='train') # Should raise error

because the evaluation is only done on the test set only, and it doesn't have access to the training images. Thus, it will take the above indexes into the test set, and might raise error.

If you really want to use training images during test (e.g., to use them inside the collection), your test set must contain the training images, and their respective indexes should be added into the 'test_indexes_collection '.

For example, if you have M images for the query set, and you want to use the N images from the training set into the collection set, you must change the 'get_testing_set' function with something like that:

def get_testing_set(self):
    x_train, y_train = self.get_training_set()
    x_test = np.concatenate([self.x_query, x_train], axis=0) # Now with size M + N
    y_test = np.concatenate([self.y_query, y_train], axis=0) # Now with size M + N

    return x_test, y_test 

In the training script, you should change 'test_indexes_query', 'test_indexes_collection' and 'queries_in_collection' as follows:

test_indexes_query = np.arange(0:M) # The first M images are used into the query set
test_indexes_collection = np.arange(M:M+N) # The rest are used into the collection set
queries_in_collection = False

Or by adding these lines into 'get_collection_inx' and 'get_queries_idx' to avoid changes in the training script.

Testing

Test Script

If you simply want to test the model without training, you have to make a new test script (which is mostly a copy-paste from the training set, up to compile and fit functions). The save part should save two things:

  1. The predictions for each image in the test set (with ‘preds’ in its name)
  2. The full model, including HORDE.

For example, in my log folder, I have:

You should load the first one, which contains all model's layers, including those from HORDE. The second one is only here to check things on the predictions (like doing t-SNE, computing other metrics like mAP, ...)

Evaluation with concatenation

The expected usage of HORDE is to simply drop the high-order moments at testing (like scores reported during training). This concatenation part is not mandatory, but it slightly increases performances, albeit at a huge cost in memory (larger model due to the high-order moments). As such, we don’t have included the script to evaluate HORDE in this setup. Yet, the implementation is simple and can be done as follows:

If you have difficulties to implement it, I should have the (quick and dirty) script somewhere on my computer.

Error while loading parameters

This error comes from the fact that the "--trainable" argument is missing. In fact, this argument adds fully connected layers on top of the high-order moment layers, in order to learn a metric, but also make the weights trainable in the moment approximations.

Model's arguments

I thought all parameters have a small description when using python3 train.py --help command… I was thinking of making an update of the repo to TF1.5 and TF2.2 (hopefully in June, but more certainly in July), so I will add them at that time. Meanwhile, here are some explanations about them:

I hope I made things clear 😊 If you have more questions, do not hesitate !

mhyeonsoo commented 4 years ago

Thanks a lot for detailed feedback. Many things became clear to me now.

About testing script, since I have not much background of Linear algebra methods, if you can support that on git page, it would be great for me so that I can refer those. (I fully understood from the extract features to concatenating them, and the only part that I need to more understanding is projecting representation related to eigenvalues)


What I have thought for now is, feature extraction with model.predict() has huge dependency on dataset sampling and collection set generation. This can be the reason of dramatic performance difference. In here, I came up with the idea that what if I add softmax loss along with the dml losses which are being used in the paper. Can you give me any idea about this trial?

(P.S. I am now implementing with TF2.1 version, and with just changing global_metric part from tf.session one to tf.function decorated function, it is perfectly working!)

I appreciate again for your guide.

pierre-jacob commented 4 years ago

Hey, sorry for the delay, I got a bit overwhelmed recently…

Ok, I got it. In this case, it is very simple: your test set should be composed of 80,000 images that are the concatenation of your 50,000 training images followed by your 30,000 testing images.

The evaluation should work, and hopefully, it will give you good results!

You simply have to ensure that images' labels from A and from B are the same for a given class. Note that it will work in this case because your test labels are the same as your training labels (well, they are a subset, but it doesn't matter). If there are few labels that don’t overlap, it will not work!

Your “concat+PCA” implementation is correct.

Concerning the TF2.2 version, I have a ``cleaned version'' which is better than the current implementation (loss functions are simpler and clearer, metrics are more general, there are more comments, ...). I hope to find some time to post a new version of the repo (definitely in July!).

If you have more questions, do not hesitate!

mhyeonsoo commented 4 years ago

Thanks for the answers :) I am also implementing and conducting some experiments along with your feedback now. I'll close this issue!