DELF: Training procedure

bkj commented 6 years ago

Are the DELF authors able to give a little more detail about how they train their model? Any insight into things like

- cross-entropy loss and/or accuracy curves during fine-tuning training and/or attention training
- number of epochs of training; number of GPUs; wall clock time
- learning rates; how layers are frozen/unfrozen
- how/whether hyperparameters were tuned on a validation set

would be super helpful. Any specific pointers to other projects (maybe in this repo?) that used a roughly similar procedure would be helpful as well.

EDIT: Also, can you verify that both the fine-tuning and attention models were trained on this dataset, rather than the Google-Landmarks dataset introduced in your paper.

Thanks Ben

cc @andrefaraujo

kuznetsoffandrey commented 5 years ago

I have a dataset of images which I want to use for DELF extraction. As far as I understood from this node the major steps of finetuning are:

fine-tune resnet classification using GetResnet50Subnetwork function from delf_v1
use AttentionModel function from delf_v1 with training_attention=True and images tensor consisting of training dataset for attention part using random scales from [0.25, 0.3536, 0.5000, 0.7072, 1.0].
delf postprocessing

Am I right?

@andrefaraujo @SunLoveSheep

kuznetsoffandrey commented 5 years ago

I received model for the first stage as net, end_points = model.GetResnet50Subnetwork(images, global_pool=True, is_training=is_training, reuse=reuse)

How can I train it now? I should use checkpoint resnet_v1_50.ckpt as far as I understood

andrefaraujo commented 5 years ago

Hi @kuznetsoffandrey ,

If all you want is to extract features using our model that is pre-trained on a landmarks dataset, you can follow the instructions here.

In terms of training: the steps you list are correct. You should first fine-tune the Resnet50, then in a second training step train only the attention part. Once these two training steps are concluded, you can run inference by extracting the features from the network, applying post-processing and selecting them based on attention scores.

In terms of training it with tensorflow: since the model is trained in a classification setting, you can do it very similarly as one would train a model on Imagenet, for example. There are several tutorials online for how to do something like this, for example:

kuznetsoffandrey commented 5 years ago

Thank you @andrefaraujo I successfully fine-tuned the Resnet50, but i am stuck with the second part. If I use keras, I should take the attention model from delf_v1, create the the same model in keras and then fine tune it with features created on the 1st step?

andrefaraujo commented 5 years ago

Yes, that's correct. If you fine-tuned the network, the next step is to train only the attention module, which you can do in a similar manner as your training for the first step, except that now all other layers should be frozen.

As I mentioned in a related question above: "For the second step: yes, AttentionModel is the right function to use. Be sure to set target_layer_type to resnet_v1_50/block3."

ArtanisCV commented 5 years ago

unfortunately I am not able to provide images. Some images were missing when we downloaded the dataset as well. This should be fine, since the dataset is usually only used for training.

Hi @andrefaraujo , can you provide the images that you have downloaded? When I go to download the DIR's Landmark dataset, more urls are broken and I can only download around 20K images (compared with 35,382 images as you mentioned in the paper). I have also requested the authors of the DIR paper for help, but they didn't keep the images as well.

andrefaraujo commented 5 years ago

Hi @ArtanisCV,

Unfortunately I cannot provide images either. I would suggest that you use the Google Landmarks Dataset (https://www.kaggle.com/google/google-landmarks-dataset), which is a much larger and more comprehensive dataset.

mmxuan18 commented 5 years ago

@andrefaraujo hi, i want to know why in this project only use resnet50 even only use block3 as feature map, as know more deep network has much good performance as feature extractor? and the attention conv layer use which size kernel, and do different kernel size influence the final result?

mmxuan18 commented 5 years ago

@XiaodanLi001 hi i am trying to reimplement this paper use pytorch too, did you solve the problem upon, and can you share you code?

andrefaraujo commented 5 years ago

@mlinxiang Yes, possibly other deeper extractors may improve performance. For attention network kernel size, we have used 1 (as you can see in the code: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/delf_v1.py#L85). We have not observed much difference by using larger values.

KuenstlicheIntelligenz commented 5 years ago

@andrefaraujo Hey, which optimizer did you use? Great work btw. :)

So i don't know if that helps, but Adam didn't work for me ;)

andrefaraujo commented 5 years ago

We are using momentum with parameter 0.9.

7rick03ligh7 commented 5 years ago

Am I clearly understand that each image descriptors are l2 normalizing independently or we normalizing each channel for few descriptors?

andrefaraujo commented 5 years ago

Each descriptor is normalized independently.

VanillaChelle commented 5 years ago

@andrefaraujo Hi, thanks for your code. I'm trying to use another dataset to train DELF. I have already finetuned the ResNet50 model successfully from the ImageNet pretrained model while get stuck with the second training stage. The AttentionModel function from delf_v1.py is the one I use to train attention module, with all the layers from ResNet frozen. Strangely after decreasing within a few steps, the loss dosen't change. I'm wondering what's the preprocessing and normalization of your input images? For example, should an image pixel be in the range [0, 1], [-1, 1] or [0, 255]?

andrefaraujo commented 5 years ago

The image pixel range should mainly depend on what you used in the finetuning stage -- ie, you should probably reuse the same convention, otherwise your Resnet-trained layers would likely not work properly. In our own code, we normalize to [-1, 1].

Please let me know if you have additional questions. The attention training stage should really be very similar to the first stage, except that you apply the attention model on top of the RN50-block3 features, and do attention-weighted average pooling and feed the pooled 1024D feature into the classifier.

saiaman commented 5 years ago

Do someone have sample code on how to train delf from scratch on another dataset?

LinXiLuo commented 5 years ago

hi @andrefaraujo I have some problems with the attention training, the loss doesn't change and stays around 6.7. The steps of mine as follows:

I implement DELF with slim and use resnet_v1_50.
Due to the lack of LC dataset (27536) and the unbalanced number of categories, the loss of the first step converges to about ~3-4, and top-1 accuracy on the LC testset is about 70-72%.
Details about the second step:
- LF data processing: 1). square_crop, crop image to square with the smallest edge. 2). tf.image.sample_distorted_bounding_box, and resize to 224x224 (for images have no bbox, is it any different from random crop?)
- take the features of ResNet50 Block3 (endpoint['resnet_v1_50/block3']) shape like [batch_size, 11, 11, 1024].
- implement l2 norm
- two layers attention net
- GAP (batch_size, 1, 1, 1024)
- then follow a conv2d logit (batch_size, 1, num_calsses) I am not sure there is anything important I missed? If I did, please tell me, THX!!!

other attempts are as follows: LC dataset for attention training. loss will coverage around 4.5~5, acc 0.27 learning rate for 1e-1 to 1e-5. didn't work. directly train Resnet50 with Block3 + GAP + conv logits. didn't work.

LinXiLuo commented 5 years ago

@andrefaraujo I am waiting for your advice. Sincerely.

andrefaraujo commented 5 years ago

@LinXiLuo , sorry for the delay!

shouldn't the resnet_v1_50/block3 endpoint have HxW of 7x7 instead of 11x11? Since the input is 224x224, and there is an effective stride of 32 between input and resnet_v1_50/block3
what do you mean by "implement l2 norm"?
BTW, details of attention blocks are given here: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/delf_v1.py#L81
You are using softmax+cross-entropy loss, right?
All the losses you are reporting seem quite high, and the accuracies quite low. Usually, we can easily get very low loss (and quite high accuracy) on LC (LF is noisier, so it plateaus around 70% accuracy). Maybe exploring more different learning rates could help. Looking at the training curves should help diagnose the issue; is the loss high even on the training set itself?

LinXiLuo commented 5 years ago

@andrefaraujo I am so glad that hear from you.

At first, I took the last layer of resnet_v1_50/block3, they are [7x7] features. However, didn't work. Then I tried to take endpoint['resnet_v1_50/block3'], because you said

Be sure to set target_layer_type to resnet_v1_50/block3

and I just found that feature is [11x11]. But now I use the [7x7] features from the last layer of resnet_v1_50/block3.

The l2 norm is implemented on [7x7] feature, as attention block.
Softmax+cross-entropy loss. Yes.
The losses are high, yes. Due to lots of invalid link of images. And I have tried many lr. But now I am trying to train on Google landmarks data (random select 1000 classes.)

The most likely reason, I think, the image processing of tf.image.sample_distorted_bounding_box.

First, due to images have no bbox, implementing tf.image.sample_distorted_bounding_box has any difference from random crop?
Second. From the visualization of the process, it seems tf.image.sample_distorted_bounding_box randomly enlarge the part of image toooo much. It is not conducive to extracting target features. And the performance is quite different from randomly resizing scale of images to [0.25, 0.3536, 0.5000, 0.7072, 1.0].

andrefaraujo commented 5 years ago

I believe the last layer of resnet_v1_50/block3 should be the same as endpoint['resnet_v1_50/block3']? If not, seems like something strange may be going on.
Regarding tf.image.sample_distorted_bounding_box, there are a few differences with respect to simple random cropping: aspect_ratio_range, area_range. If you are currently having issues with this function, maybe just try center cropping initially to see if the loss goes down; if so, maybe indeed there is a bug in your usage of this function. In any case, your loss seems way too high.

In the Google Landmarks dataset, we are able to train the model much better than in the LC/LF datasets, the loss goes down below 1 and train/validation accuracy is above 90%.

LinXiLuo commented 5 years ago

Hi @andrefaraujo I am sorry to bother you again. I am stuck with the attention training stage.

I have a few points to confirm about preprocessing of attention stage.

For multi-scale implement.

square crop to 900x900
random crop to 720x720
Randomly rescale to 720*gamma with 0.25 < gamma < 1
due to the same image size for mini-batch, surround the rescaled images with padding zero for resizing to 720x720.
feed 720x720 padded images with batch_size 16 (do you resize the padded images to 224x224 ? because it is faster to training with batch_size 512.)

For distorted bbox implement.

tf.image.sample_distorted_bounding_box( tf.shape(image), bounding_boxes=[[[0, 0, 1, 1]]])
resize distorted images to 224x224 with bilinear.

The two methods above, that you mentioned in paper and issue. However, either way, I still found the model can not converge and the loss always remains around 7.2 in the second attention stage.

BTW, I tried to just simply use the same preprocessing as the first stage, but it cannot converge either.

andrefaraujo commented 5 years ago

HI @LinXiLuo

Let me first say that it sounds like something else is likely wrong, I believe, because a loss of 7.2 in the attention training stage is too high. In my experience training several versions of DELF, these image preprocessing techniques don't make that much of a difference in the loss; even with little preprocessing, the loss should already start converging well (in the Google Landmarks dataset, it would get below 1.0).

1) Actually for the method described in the paper, we resize an entire mini-batch at a time, so then there is no need for padding. Eg, say gamma is 0.25 --> resize the entire mini-batch to 720*0.25 = 180 in each spatial dimension.

2) Looks correct.

LinXiLuo commented 5 years ago

@andrefaraujo Thanks for your reply. Unfortunately, the loss always keeps in 7.2~7.4 in the whole training process. It seems nothing to be learned! I am also checking other possible issues in my code.

First, the backbone. I replaced my ckpt by yours delf_v1_2017026 to only focus on attention training. (I only save the layers before attention layer in delf_v1_2017026, and restore it in my implement.) However, didn't work.
Second, the implement. I replaced my code with AttentionModel DelfV1().AttentionModel(images=image, num_classes=1000, training_attention=True) Using the official ckpt and the official implement, and only focus on training attention layer and logits. The loss still remained to ~7.2.
Third, a simple task. I tried to a simple image recognition task for attention training stage. two classes landmarks recognition. 40,000 images for each class in training phase, 6800 for testing phase. The huge amount of data is enough to make the network over-fitting on the small dataset. The loss drops from ~0.88 to ~0.76. However, accuracy is always 0.49-0.51. It seems like a random guess machine : (
Forth, optimize
1. optimizer. momentun with 0.9, lr with 1e-1.
2. loss function. slim.losses.softmax_cross_entropy(logits, labels, label_smoothing=FLAGS.label_smoothing, weights=1.0)

I am still confused about what going wrong for training the attention layers.

andrefaraujo commented 5 years ago

@LinXiLuo indeed, it sounds like there may be something wrong with your code. If you have code on github somewhere, I am happy to take a quick look and see if I can spot something.

LinXiLuo commented 5 years ago

@andrefaraujo You are so kind! The reimplement has been pushed on my repo, including prepossessing, train, eval. https://github.com/LinXiLuo/my_delf/tree/master/research/delf/delf/python/training If potential bugs are found, please let me know. Thanks a looooooot!

andrefaraujo commented 5 years ago

@LinXiLuo

I took a quick look, couldn't find anything wrong with it. One idea I had is if you could inspect the checkpoints to make sure training is indeed happening, variables are being updated. Here's a handy TF tool for this: inspect_checkpoint

Note that in the attention training stage only the attention and classifier layers should be changing, so you could also check that in the second stage the RN50 is not changing.

LinXiLuo commented 5 years ago

@andrefaraujo

No more words can describe my gratitude to you. Actually, I have visualized the training process on tensorboard. The changes of the two attention and classifier layers are very slow and subtle.

TianMingChen commented 5 years ago

The dataset is correct, but the data we end up using was cleaned and released by the DIR paper, "full" and "clean" subsets. I believe the data (ie, URLs and labels) continues to be available on their website here. We used the train and validation sets defined in the same DIR paper (hyperparameters were tuned on the validation set).

We trained with different settings. When using a single GPU, training would take ~ 20 hours for each stage (fine-tuning and attention training). This would make for ~50 epochs in the Landmarks-Clean dataset, and ~12 epochs in the Landmarks-Full dataset. If using multiple GPUs, this can be sped up by a lot. We tried learning rates from 1e-1 to 1e-4, and picked the best run in the validation set. Note that we started from a pre-trained Imagenet model.

Can you upload the train code? Thanks.

anmol4210 commented 4 years ago

Hi @andrefaraujo I am able to run the two-step finetune for DELF on the landmark dataset. The first step to finetune the original ResNet-50 as classification network seems ok and converge to ~90% top1 accuracy on Landmark Clean dataset. However, the second step to train the attention layers seems strange. I built the model by AttentionModel() function from delf_v1.py (should be correct to use this one?). On single GPU (GTX1080ti), I can only run batch size ~10 for the 900x900 center cropped and 720*r random down scaled input images. It is not converging anyway, testing acc stays extreme low. Can you share some light on how you set your training parameters for attention layers with single GPU? And I'm finetuning original ResNet by Landmark Clean dataset and the attention layers by Landmark Full dataset, is this same with your setting? Thanks!

Hi @SunLoveSheep Could you please share your training code

uroosehar1 commented 4 years ago

Great to here that, the basic overflow you can learn from one of the best video I have seen URL: https://youtu.be/tfwQA3jy4Ks

andrefaraujo commented 4 years ago

Hello, I wanted to provide an update to this thread. We have just released a TF2-compatible version of the DELF library, which now also includes starter training code: https://github.com/tensorflow/models/tree/master/research/delf/delf/python/training

Please feel free to open another issue in case you encounter problems, happy to help.

cristyioan2000 commented 2 years ago

Hi, I am not very sure how this PR curve was obtained (Figure 5), was the model trained on the LF and LC dataset from the DIR paper and validated on the Google Landmarks v2 dataset ?

andrefaraujo commented 2 years ago

This curve was obtained with the initial version of the Google Landmarks dataset, which we used in the DELF paper. The dataset was modified before releasing it as GLDv1, so the curve cannot be reproduced exactly with the current externally-available dataset.

cristyioan2000 commented 2 years ago

Hi, the LC and LF datasets are no longer available here. There are some urls from this .txt files that are not working anymore, and also there are a lot of corrupted images. Is there another place where the LC and LF are stored as images not urls?

andrefaraujo commented 2 years ago

Not that I know of. You should contact the original authors from the LC an LF datasets for that.

cristyioan2000 commented 2 years ago

Hi, when training with Google Landmarks v2 what dataset was used during validation ? (During Fine-Tune and Attention training)

andrefaraujo commented 2 years ago

Please see the answer to that in these detailed instructions.

tensorflow / models

DELF: Training procedure #3387