tensorflow / models

Models and examples built with TensorFlow
Other
76.87k stars 45.82k forks source link

DELF: Training procedure #3387

Closed bkj closed 4 years ago

bkj commented 6 years ago

Are the DELF authors able to give a little more detail about how they train their model? Any insight into things like

- cross-entropy loss and/or accuracy curves during fine-tuning training and/or attention training
- number of epochs of training; number of GPUs; wall clock time
- learning rates; how layers are frozen/unfrozen
- how/whether hyperparameters were tuned on a validation set

would be super helpful. Any specific pointers to other projects (maybe in this repo?) that used a roughly similar procedure would be helpful as well.

EDIT: Also, can you verify that both the fine-tuning and attention models were trained on this dataset, rather than the Google-Landmarks dataset introduced in your paper.

Thanks Ben

cc @andrefaraujo

andrefaraujo commented 6 years ago

The dataset is correct, but the data we end up using was cleaned and released by the DIR paper, "full" and "clean" subsets. I believe the data (ie, URLs and labels) continues to be available on their website here. We used the train and validation sets defined in the same DIR paper (hyperparameters were tuned on the validation set).

We trained with different settings. When using a single GPU, training would take ~ 20 hours for each stage (fine-tuning and attention training). This would make for ~50 epochs in the Landmarks-Clean dataset, and ~12 epochs in the Landmarks-Full dataset. If using multiple GPUs, this can be sped up by a lot. We tried learning rates from 1e-1 to 1e-4, and picked the best run in the validation set. Note that we started from a pre-trained Imagenet model.

bkj commented 6 years ago

Fantastic -- thanks for the detailed response.

Do you happen to have any of the results showing cross-entropy loss and/or accuracy for the models you've trained?

Also, is this a correct summary of the image preprocessing you did before training the two stages of your model:

Fine tuning preprocessing:

    - Center crop to square image
    - Rescale to 250x250
    - Randomly crop 224x224 

Attention preprocessing:

    - Center crop to square image
    - Rescale to 900x900
    - Randomly crop 720x720
    - Randomly rescale with gamma < 1 (How did you sample gamma?)

Did you do any other data augmentation (color perturbations, mirror image, perturb colors, etc)?

bkj commented 6 years ago

@andrefaraujo

Actually -- I may be able to answer my own question about the loss and accuracy of the fine-tuned models. But could you please point me to the name of the layer in the pretrained model that would precede the softmax layer in the finetuned model? I should be able to grab the output using graph.get_tensor_by_name and then train my own linear classifier to estimate the loss/accuracy but I want to make sure I'm using the right outputs.

Thanks again for your help. ~ Ben

andrefaraujo commented 6 years ago
bkj commented 6 years ago

Ah ok -- thanks. Do you know the names of the layers that I'd need to grab to get a) the features and b) the attention weights? The output of graph.get_tensor_by_name('features:0') look like they've already gone through non-maximum suppression and PCA -- presumably if I want to train the softmax classifier, I should be using the original dimension features?

EDIT: My mistake, had a typo -- it looks like graph.get_tensor_by_name('features:0') is the full 1024 dimensional vector and setting:

  max_feature_num: 9999999
  score_threshold: -1

in the config gets all of the features. Thanks again.

~ Ben

andrefaraujo commented 6 years ago

right. Also, for the tensor names, you could directly look at the ones used in the example extraction script: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/extract_features.py#L103

chenbiaolong commented 6 years ago

@andrefaraujo I have some question about your training. DELF employ a two-step training strategy, first training a classification task, second learning the attention weight.So, at step 1, the logit is base on the feature_map * attention_prob, but the attention_prob initialized with random parameters? or at step 1 you bypass the attention_prob?

andrefaraujo commented 6 years ago

In step 1, no attention layers are used. In this case, a simple average pooling is performed.

skelkar9794 commented 6 years ago

@andrefaraujo I would like to train DELF model on places dataset (places.csail.mit.edu/). It will be really helpful if I get to know about the modifications that I need to make.

andrefaraujo commented 6 years ago

I guess the information in this issue is a good starting point? Let us know if you have specific questions.

offbye commented 6 years ago

@andrefaraujo Could you open source the training code ?

andrefaraujo commented 6 years ago

@offbye We are not planning on open-sourcing the training code at this point, sorry.

reido2012 commented 6 years ago

@andrefaraujo I believe that I have managed to train the model from scratch using the delf_v1.py. (Not using the pretrained model as my use case is slightly different to landmark classification)

In the extract_features.py code, I load the graph/model from the checkpoint. However the code is looking for tensors that don't exist in my checkpoint.

For example it attempts to get the tensor named 'features:0' which in my case is most likely 'Conv2d_13_pointwise' (the target layer in my version of delf_v1 as I've replaced resnet with mobilenet).

To run feature_extractor.DelfFeaturePostProcessing() I also need the tensor for 'boxes:0' Looking at the docstring in the feature post processing function and the output of delf_v1.AttentionModel it appears that the model doesn't return the value for boxes(or a value that can be interpreted as receptive field boxes) to my understanding.

The docstring for DelfFeaturePostProcessing refers to the value of boxes being [number of final feature points, 4] that make it pass keypoint selection and NMS steps.

Where in delf_v1.py - (which tensor) does boxes come from? In delf_v1.py could you clarify where NMS happens?

Any clarification would be of immense help.

Thanks

andrefaraujo commented 6 years ago

Hi @reido2012 ,

Great to hear you were able to train it!

The step you seem to be missing is to apply some post-processing operations to the extracted features. Essentially, you need to call the ExtractKeypointDescriptor function (from the feature_extractor.py file), which will give you the boxes, features, etc (note that this function requires a model_fn argument, which you should set to the output of BuildModel). A simplified example of how to use ExtractKeypointDescriptor can be seen in the file feature_extractor_test.py.

After extracting those, you can then call DelfFeaturePostProcessing to obtain the final locations and descriptors.

Hope this helps!

yichengwang125 commented 5 years ago

Hi all, I tried to train the DELF network with delf_v1.py, but I was not sure how to do multi-scale training during attention stage. The following was what I did:

  1. Fine tuning the Resnet_v1 with images of 224*224
  2. Randomly resize images in the range of 112 112 to 720720, and random_crop_or_pad to 224*224, then feed into the whole network in delf_v1.DelfV1().
  3. During the inference stage, use full images with 7 scales as inputs to extract features

But the features extracted without PCA do not seem correct. There are two lines of features on the right and bottom for every image, an example is shown in the link https://drive.google.com/file/d/18tIR1n4qmpk4WMUIGft-MJyg_eiWDhME/view?usp=sharing

I am not sure whether this is due to inconsistence of inputs when training and testing (say like the SAME zero padding strategy of Resnet result in wider black border after convolution op) And I tried to resize the input randomly every 10 epoch like YOLOv2, but after I use inputs, labels = tf.train.shuffle_batch([image, label], batch_size=128, num_threads=4, capacity= 1000, min_after_dequeue=616) the batch of images can not be resized once training starts

Do you guys have any ideas or advices on the problem?

Thanks, cc @andrefaraujo @reido2012

SunLoveSheep commented 5 years ago

Hi @andrefaraujo I am able to run the two-step finetune for DELF on the landmark dataset. The first step to finetune the original ResNet-50 as classification network seems ok and converge to ~90% top1 accuracy on Landmark Clean dataset. However, the second step to train the attention layers seems strange. I built the model by AttentionModel() function from delf_v1.py (should be correct to use this one?). On single GPU (GTX1080ti), I can only run batch size ~10 for the 900x900 center cropped and 720*r random down scaled input images. It is not converging anyway, testing acc stays extreme low. Can you share some light on how you set your training parameters for attention layers with single GPU? And I'm finetuning original ResNet by Landmark Clean dataset and the attention layers by Landmark Full dataset, is this same with your setting? Thanks!

andrefaraujo commented 5 years ago

Hi @yichengwang125,

Actually, step (2) is different from what we describe in the paper:

In this case, the input images are initially center-cropped to produce square images, and rescaled to 900 × 900. Random 720 × 720 crops are then extracted and finally randomly scaled with a factor γ ≤ 1.

So the final resolution used during training is randomly determined (720 or reduced by a factor γ picked uniformly at random from [0.25, 0.3536, 0.5000, 0.7072, 1.0]).

Also, as per my comment in this issue on Feb 20: we have more recently been using tf.image.sample_distorted_bounding_box and that seems to deliver similar results as the random scaling preprocessing described in the paper, and is also easier to implement.

The two lines of features on the right/bottom are definitely quite strange... Can you give a try to one of these two ideas, and see if it helps? Let me know how it goes.

andrefaraujo commented 5 years ago

Hi @SunLoveSheep ,

Cool, the first step seems to work.

For the second step: yes, AttentionModel is the right function to use. Be sure to set target_layer_type to resnet_v1_50/block3. And yes, the datasets you are using seem correct: Landmarks-Clean for the ResNet fine-tuning and Landmarks-Full for the attention training stage.

It seems strange that the model is not converging in the second stage... usually it does converge well. As a sanity check, did you try to not use random scaling at all during attention training, and simply same preprocessing as in the first training step? Usually that should converge (although, even if it converges, in our experience it does not work that well when performing the retrieval experiment). Also, as mentioned in my previous post, more recently we have been using tf.image.sample_distorted_bounding_box and that seems to deliver similar results as the random scaling preprocessing described in the paper, while being simpler to implement.

Also, one extra thing to note: Landmarks-Full is quite a noisy dataset. Usually the loss will converge to about ~1.5-2, and top-1 accuracy on the Landmarks-Full validation set is about 65-70%.

In more recent experiments, we are just using the Google-Landmarks dataset in the two training steps, and we see an improvement in performance.

yichengwang125 commented 5 years ago

Hi @andrefaraujo , Thanks for your reply. Can you show me some details when you use tf.image.sample_distorted_bounding_box? I think with tf.image.sample_distorted_bounding_box I can get some cropped batches of images with various scales. Do you keep the size of cropped images as input to attention net or resize to 224 × 224?

Thanks

andrefaraujo commented 5 years ago

Hi @yichengwang125 ,

I do resize the output of tf.image.sample_distorted_bounding_box to a fixed resolution for training, such that all batches have same dimensions. I am currently using 321x321, but I expect similar results if using 224x224.

Since there are no bounding boxes in the dataset, we can just set the argument bounding_boxes to [0, 0, 1, 1]. As for the other arguments, using the defaults should be good enough.

SunLoveSheep commented 5 years ago

Thanks for the rapid reply @andrefaraujo !

Actually just solved the problem... Some bug in my data preprocessing. And also thanks for the clarifying of random downscaling (720 or reduced by a factor γ picked uniformly at random from [0.25, 0.3536, 0.5000, 0.7072, 1.0]), I was trying to use a pure randomly generated factor from [0.2, 1]. But it also works anyway. Will try your factor strategy instead.

nathangq commented 5 years ago

@yichengwang125

hi, your image values fed to the training net are from 0.0-255.0, if you set them to 0.0-1.0, without preprocessing minus channel mean (don't know if it works with minus channel mean), it could fix the problem that attention network has high score to patches near image border.

maybe because [0,1] is a more compact interval and the attention net could be more distinguishable to value 0 and value near 0, and thus be sensitive to patches with a lot of values zeros.

zhaows commented 5 years ago

hi @andrefaraujo could you tell me how to compute pca projection matrix?

andrefaraujo commented 5 years ago

You can read about PCA in the wikipedia webpage: https://en.wikipedia.org/wiki/Principal_component_analysis

There are several implementations in machine learning / statistics libraries. One example is scikit-learn's implementation: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

XiaodanLi001 commented 5 years ago

hi @andrefaraujo , I saw your reply about training procedure, so, did you mean, while fine-tuning, we can directly train resnet50 with block3+GAP (cut block4+fc in resnet50) ? And while training attention module, just fix the finetuned model and train two conv layer, which got block3 features as input? I'm not sure if I took it right.

andrefaraujo commented 5 years ago

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

SunLoveSheep commented 5 years ago

hi @andrefaraujo Sorry but still has some problems to re-implement the results in DELF paper...

I got the whole model trained from ImageNet pre-trained weights. When I test and compare my model and your original (saved_model.pb) weights to match image pairs, they perform similarly on oxford5k landmark images. But when I test on my own image pair, where the query is, for example, the front page of a magazine, and the target image is a person holding that magazine. Your model extracts features from the magazine region (on target image) but mine seems extracts from the whole image. The descriptor locations are everywhere instead of focusing on magazine region. Any guess on what is the problem?

Could you reveal how you get the PCA matrix and mean you are using in the "delf_config_example.pbtxt"?

Moreover, I tried to perform retrieval on oxford5k dataset with your released model. I extract DELF features on each images in dataset and query by release code first, then perform retrieval on each query and record results in a brute-force fashion. Threshold is the number of inliers (10). Finally, mAP is calculated using oxford5k official .cpp script. But I can only get ~65.9 mAP instead of 83.8 reported in the paper. Any other tricks I need to perform to get the result in your paper from the code release?

Thanks!

andrefaraujo commented 5 years ago

hi @SunLoveSheep ,

1) When you say "the whole model trained from ImageNet pre-trained weights", do you mean a model that was trained on the Landmarks dataset starting from an Imagenet checkpoint? (I think this is what you mean, just wanted to confirm). In this case, it sounds like the attention part of your network might not be well-trained. However, let me also say that: the model is not expected to focus on magazines, right? (since it was trained on landmarks, it should be doing well on landmarks but not necessarily on other objects)

2) The PCA matrix/mean are simply obtained from standard computations on the Landmarks-Clean dataset. The mean is simply the mean of all descriptors extracted in the dataset. The PCA matrix is computed in the standard way (eigenvectors of descriptor covariance matrix).

3) In terms of Oxford5k numbers: this dataset has a quite large number of expected retrieved results per query image. If you only rank the images based on the number of inliers, the mAP will not be that high. What we did is to, first, record the number of similar features between the query and each database image -- rank all images based on this. Then, for the images with a sufficiently large number of similar features, re-rank based on the number of inliers. This should help boost the mAP quite a bit. I believe that if you reduce the inlier threshold your results might improve even more.

XiaodanLi001 commented 5 years ago

hi @andrefaraujo , I want to train your network with landmarks dataset mentioned in "Neural Codes for Image Retrieval" and when I download the dataset, I got a txt with image links in it. But some were invalid now and I can not download images. Can you provide me with image datasets? Thanks a lot.

andrefaraujo commented 5 years ago

Hi @XiaodanLi001 , unfortunately I am not able to provide images. Some images were missing when we downloaded the dataset as well. This should be fine, since the dataset is usually only used for training.

SunLoveSheep commented 5 years ago

Hi @andrefaraujo

Thanks for the input! Will try implement 2 and 3. As for 1, yes that is what I mean... Sorry for the confusing. It turned out that I made some mistake on test image pre-processing. Now we have no problem on this issue. The network can focus on magazine because we managed to repeat the training on our self-collected dataset ~

XiaodanLi001 commented 5 years ago

Hi @andrefaraujo Thank you for you reply. When I was doing feature extracting, there's a PCA procedure. But the PCA matrix is loaded from a file. So , if I want to apply it with new dataset, I have to get PCA matrix on all features extracted from training data first, right? And during inference, what we need to do is directly loading it, am I right?

andrefaraujo commented 5 years ago

@XiaodanLi001 : yes, correct.

XiaodanLi001 commented 5 years ago

hi @andrefaraujo , sorry for interrupt you again. I tried to re-product your work. And everything goes fine before I tried to match the extracted features with two similar images. The attention map seems good and focus on the discriminative areas. But the extraceted features can not be matched to the right one. And I have checked it for many times, it is caused by the feature. Do you have any advice?

andrefaraujo commented 5 years ago

Hi @XiaodanLi001 , hmmm hard to tell. A few things you can try for debugging purposes are: a) turn off PCA, and simply match the 1024D features directly; b) make the feature matching thresholds quite loose (eg, the https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/match_images.py#L44); c) increase the number of RANSAC iterations or residual threshold (https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/match_images.py#L83).

Overall, it seems like you are quite close. Getting the attention part is usually the hardest part, which makes the most difference. We usually observe that the descriptors work quite well even if we do not fine-tune (in our paper, we have an experiment like this).

XiaodanLi001 commented 5 years ago

hi @andrefaraujo , the suggestions you mentioned above have been tried and it still doesn't work. I have also tried to use the "weight_decay" to train this network, it helped but not that much helped. I think I should try to use the descriptors multiply attention instead of raw descriptors whose attention score is beyond the threshold. In fact, I have no idea about how to fix it. Anyway, thank you so much for your help.

andrefaraujo commented 5 years ago

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

To double-check, are you training in this manner? Note that while training attention the features should be frozen and only the attention layers change.

XiaodanLi001 commented 5 years ago

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

To double-check, are you training in this manner? Note that while training attention the features should be frozen and only the attention layers change.

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

To double-check, are you training in this manner? Note that while training attention the features should be frozen and only the attention layers change.

Yes. Firstly, I trained resnet50 with only the fc classes changed (there's no layer is frozen in this step). And the validation classification accuracy is about 93.5 or so. Then I freeze all the weights on it. And apply attention after block3. Only the attention part (two conv layers, one with relu, one with softplus) and the last fc (a conv with kernel size 1x1) will be trained now. The attention mul featuremap from block3 and then average pooling, which will be fed into fc layer. That's all I did. If I took something wrong, please tell me.

XiaodanLi001 commented 5 years ago

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

To double-check, are you training in this manner? Note that while training attention the features should be frozen and only the attention layers change.

No, I meant that in the fine-tuning step one should use the standard resnet50 (all blocks, average pooling of block4 features, followed by the classifier's FC layer). And while training attention one should use the finetuned model, but only with blocks up until (and including) block3 (it should do attention-weighted pooling of block3 features, followed by the classifier's FC layer).

To double-check, are you training in this manner? Note that while training attention the features should be frozen and only the attention layers change.

frozen including the last FC or not?

andrefaraujo commented 5 years ago

The FC layer should not be frozen. I guess this is what you are doing?

The training procedure you described seems like the correct thing to do.

Other questions:

XiaodanLi001 commented 5 years ago

The FC layer should not be frozen. I guess this is what you are doing?

The training procedure you described seems like the correct thing to do.

Other questions:

  • Are you starting the fine-tuning process from an ImageNet model checkpoint?
  • Are you using Tensorflow? If so, are you using the TF-Slim implementation of Resnet?

Yes, I finetune the model from an ImageNet model checkpoint. I'm using pytorch instead of tensorflow. I think you are asking for the stride change in your resnet version? I have tried to get the same stride (block 1: stride =2, block2: stride=2, block 3: stride=2, block4: stride=1) with yours last night, it still doesn't work.

andrefaraujo commented 5 years ago

Right, I was indeed asking about the strides. So, to make sure, if you input just the original image to the network, without other scales, you will obtain a feature map with 32X smaller than the image on each side. Correct?

Next questions:

XiaodanLi001 commented 5 years ago

Right, I was indeed asking about the strides. So, to make sure, if you input just the original image to the network, without other scales, you will obtain a feature map with 32X smaller than the image on each side. Correct?

Next questions:

  • What accuracies do you get when training the attention part?
  • When performing inference, are you using multiple scales as in the examples provided in this codebase?
  • When performing inference, are you resizing images to a particular resolution? We usually just let the network run in fully-convolutional mode and do not resize the input images at all. (note that this is different from many CNNs which only accept a fixed size image input)
  1. For the training attention part, the accuracy for training dataset is 99%, and 93% for the validatation. I have checked the attention map, it focused very well.
  2. I tried two method. The first one is, resize it into 224x224 and do the inference. The second one is using multiple scales as your code does. I tried these two methods, and checked the interest points, all of them seem to be fine and focus on the right place.
  3. When utilizing multi-scale, I didn't resize the input images at all. Images are resized only during feature_extractor step. Since the feature map is extracted from block3 with different scales, so the feature map may have multiple sizes according to its scale setting.

Besides, in the training step, images are center-cropped and are resized to 250x250 ,next random cropped to get 224x224 patch, which is fed into the training network. There's one thing, I trained resnet as well as attention part on Landmark-dataset-clean. I don't know if this matters that much.

andrefaraujo commented 5 years ago

Thanks again for these answers.

XiaodanLi001 commented 5 years ago

Thanks again for these answers.

  • So, to make sure, if you input just the original image to the network, without other scales, you will obtain a feature map which is 32X smaller than the image on each side. Correct?
  • How are you setting the location for each local feature? Are you using the same parameters as here: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/feature_extractor.py#L127 ? Also, you need to take the scale into account, naturally. I am asking this because if the locations are buggy, then the spatial matching would not work.

Yes. The obtained feature map is 32X smaller than the input image, for example, 224x224 results in a 7x7 feature map. And the location processing module is the same as your. which has been checked for many times. The parameters are also the same with yours. Block3, rf, stride, padding = [291.0, 32.0, 145.0] ,depth = 1024. I have visualized the extracted interest points at each scale and restore them to the image with original size according to your match_images.py, so I'm sure about this.

XiaodanLi001 commented 5 years ago

Thanks again for these answers.

  • So, to make sure, if you input just the original image to the network, without other scales, you will obtain a feature map which is 32X smaller than the image on each side. Correct?
  • How are you setting the location for each local feature? Are you using the same parameters as here: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/feature_extractor.py#L127 ? Also, you need to take the scale into account, naturally. I am asking this because if the locations are buggy, then the spatial matching would not work.

Besides, I have also tried to use a big attention threshold the keep only a little points in case of location bug. The distance between two similar points are bigger than others. That's what makes me believe that the extracted feature is not that much good. I really want to fix this problem but I don't know what to do.

andrefaraujo commented 5 years ago

One other thought: are you L2-normalizing the descriptors?

XiaodanLi001 commented 5 years ago

One other thought: are you L2-normalizing the descriptors?

Yes

SunLoveSheep commented 5 years ago

Hi @andrefaraujo,

Just want to make sure if my understanding is correct. In your previous reply about improving mAP: "3. In terms of Oxford5k numbers: this dataset has a quite large number of expected retrieved results per query image. If you only rank the images based on the number of inliers, the mAP will not be that high. What we did is to, first, record the number of similar features between the query and each database image -- rank all images based on this. Then, for the images with a sufficiently large number of similar features, re-rank based on the number of inliers. This should help boost the mAP quite a bit. I believe that if you reduce the inlier threshold your results might improve even more."

Seems your 3rd point is similar to the implementation in the official match_images.py where scipy.spatial.cKDTree is used to get all descriptor pairs that are similar enough from two given images. We can then rank all images in database based on this "number of similar features". After that, among the top ranked images, we further deploy RANSAC to get inliers and re-rank again. Is this what you mean in the 3rd point?

Thanks!

andrefaraujo commented 5 years ago

Yes, that's exactly what I meant. @SunLoveSheep