Closed bkj closed 4 years ago
I have a dataset of images which I want to use for DELF extraction. As far as I understood from this node the major steps of finetuning are:
Am I right?
@andrefaraujo @SunLoveSheep
I received model for the first stage as net, end_points = model.GetResnet50Subnetwork(images, global_pool=True, is_training=is_training, reuse=reuse)
How can I train it now? I should use checkpoint resnet_v1_50.ckpt as far as I understood
Hi @kuznetsoffandrey ,
If all you want is to extract features using our model that is pre-trained on a landmarks dataset, you can follow the instructions here.
In terms of training: the steps you list are correct. You should first fine-tune the Resnet50, then in a second training step train only the attention part. Once these two training steps are concluded, you can run inference by extracting the features from the network, applying post-processing and selecting them based on attention scores.
In terms of training it with tensorflow: since the model is trained in a classification setting, you can do it very similarly as one would train a model on Imagenet, for example. There are several tutorials online for how to do something like this, for example:
Thank you @andrefaraujo I successfully fine-tuned the Resnet50, but i am stuck with the second part. If I use keras, I should take the attention model from delf_v1, create the the same model in keras and then fine tune it with features created on the 1st step?
Yes, that's correct. If you fine-tuned the network, the next step is to train only the attention module, which you can do in a similar manner as your training for the first step, except that now all other layers should be frozen.
As I mentioned in a related question above: "For the second step: yes, AttentionModel is the right function to use. Be sure to set target_layer_type to resnet_v1_50/block3."
unfortunately I am not able to provide images. Some images were missing when we downloaded the dataset as well. This should be fine, since the dataset is usually only used for training.
Hi @andrefaraujo , can you provide the images that you have downloaded? When I go to download the DIR's Landmark dataset, more urls are broken and I can only download around 20K images (compared with 35,382 images as you mentioned in the paper). I have also requested the authors of the DIR paper for help, but they didn't keep the images as well.
Hi @ArtanisCV,
Unfortunately I cannot provide images either. I would suggest that you use the Google Landmarks Dataset (https://www.kaggle.com/google/google-landmarks-dataset), which is a much larger and more comprehensive dataset.
@andrefaraujo hi, i want to know why in this project only use resnet50 even only use block3 as feature map, as know more deep network has much good performance as feature extractor? and the attention conv layer use which size kernel, and do different kernel size influence the final result?
@XiaodanLi001 hi i am trying to reimplement this paper use pytorch too, did you solve the problem upon, and can you share you code?
@mlinxiang Yes, possibly other deeper extractors may improve performance. For attention network kernel size, we have used 1 (as you can see in the code: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/delf_v1.py#L85). We have not observed much difference by using larger values.
@andrefaraujo Hey, which optimizer did you use? Great work btw. :)
So i don't know if that helps, but Adam didn't work for me ;)
We are using momentum with parameter 0.9.
Am I clearly understand that each image descriptors are l2 normalizing independently or we normalizing each channel for few descriptors?
Each descriptor is normalized independently.
@andrefaraujo Hi, thanks for your code. I'm trying to use another dataset to train DELF. I have already finetuned the ResNet50 model successfully from the ImageNet pretrained model while get stuck with the second training stage. The AttentionModel function from delf_v1.py is the one I use to train attention module, with all the layers from ResNet frozen. Strangely after decreasing within a few steps, the loss dosen't change. I'm wondering what's the preprocessing and normalization of your input images? For example, should an image pixel be in the range [0, 1], [-1, 1] or [0, 255]?
The image pixel range should mainly depend on what you used in the finetuning stage -- ie, you should probably reuse the same convention, otherwise your Resnet-trained layers would likely not work properly. In our own code, we normalize to [-1, 1].
Please let me know if you have additional questions. The attention training stage should really be very similar to the first stage, except that you apply the attention model on top of the RN50-block3 features, and do attention-weighted average pooling and feed the pooled 1024D feature into the classifier.
Do someone have sample code on how to train delf from scratch on another dataset?
hi @andrefaraujo I have some problems with the attention training, the loss doesn't change and stays around 6.7. The steps of mine as follows:
other attempts are as follows: LC dataset for attention training. loss will coverage around 4.5~5, acc 0.27 learning rate for 1e-1 to 1e-5. didn't work. directly train Resnet50 with Block3 + GAP + conv logits. didn't work.
@andrefaraujo I am waiting for your advice. Sincerely.
@LinXiLuo , sorry for the delay!
@andrefaraujo I am so glad that hear from you.
Be sure to set target_layer_type to resnet_v1_50/block3
and I just found that feature is [11x11]. But now I use the [7x7] features from the last layer of resnet_v1_50/block3.
The l2 norm is implemented on [7x7] feature, as attention block.
Softmax+cross-entropy loss. Yes.
The losses are high, yes. Due to lots of invalid link of images. And I have tried many lr. But now I am trying to train on Google landmarks data (random select 1000 classes.)
The most likely reason, I think, the image processing of tf.image.sample_distorted_bounding_box
.
First, due to images have no bbox, implementing tf.image.sample_distorted_bounding_box
has any difference from random crop?
Second. From the visualization of the process, it seems tf.image.sample_distorted_bounding_box
randomly enlarge the part of image toooo much. It is not conducive to extracting target features. And the performance is quite different from randomly resizing scale of images to [0.25, 0.3536, 0.5000, 0.7072, 1.0]
.
I believe the last layer of resnet_v1_50/block3 should be the same as endpoint['resnet_v1_50/block3']? If not, seems like something strange may be going on.
Regarding tf.image.sample_distorted_bounding_box
, there are a few differences with respect to simple random cropping: aspect_ratio_range
, area_range
. If you are currently having issues with this function, maybe just try center cropping initially to see if the loss goes down; if so, maybe indeed there is a bug in your usage of this function. In any case, your loss seems way too high.
In the Google Landmarks dataset, we are able to train the model much better than in the LC/LF datasets, the loss goes down below 1 and train/validation accuracy is above 90%.
Hi @andrefaraujo I am sorry to bother you again. I am stuck with the attention training stage.
I have a few points to confirm about preprocessing of attention stage.
square crop to 900x900
random crop to 720x720
Randomly rescale to 720*gamma with 0.25 < gamma < 1
due to the same image size for mini-batch, surround the rescaled images with padding zero for resizing to 720x720.
feed 720x720 padded images with batch_size 16 (do you resize the padded images to 224x224 ? because it is faster to training with batch_size 512.)
tf.image.sample_distorted_bounding_box( tf.shape(image), bounding_boxes=[[[0, 0, 1, 1]]])
resize distorted images to 224x224 with bilinear.
The two methods above, that you mentioned in paper and issue. However, either way, I still found the model can not converge and the loss always remains around 7.2 in the second attention stage.
BTW, I tried to just simply use the same preprocessing as the first stage, but it cannot converge either.
HI @LinXiLuo
Let me first say that it sounds like something else is likely wrong, I believe, because a loss of 7.2 in the attention training stage is too high. In my experience training several versions of DELF, these image preprocessing techniques don't make that much of a difference in the loss; even with little preprocessing, the loss should already start converging well (in the Google Landmarks dataset, it would get below 1.0).
1) Actually for the method described in the paper, we resize an entire mini-batch at a time, so then there is no need for padding. Eg, say gamma is 0.25 --> resize the entire mini-batch to 720*0.25 = 180 in each spatial dimension.
2) Looks correct.
@andrefaraujo Thanks for your reply. Unfortunately, the loss always keeps in 7.2~7.4 in the whole training process. It seems nothing to be learned! I am also checking other possible issues in my code.
First, the backbone. I replaced my ckpt by yours delf_v1_2017026 to only focus on attention training. (I only save the layers before attention layer in delf_v1_2017026, and restore it in my implement.) However, didn't work.
Second, the implement.
I replaced my code with AttentionModel
DelfV1().AttentionModel(images=image, num_classes=1000, training_attention=True)
Using the official ckpt and the official implement, and only focus on training attention layer and logits.
The loss still remained to ~7.2.
Third, a simple task. I tried to a simple image recognition task for attention training stage. two classes landmarks recognition. 40,000 images for each class in training phase, 6800 for testing phase. The huge amount of data is enough to make the network over-fitting on the small dataset. The loss drops from ~0.88 to ~0.76. However, accuracy is always 0.49-0.51. It seems like a random guess machine : (
Forth, optimize
slim.losses.softmax_cross_entropy(logits, labels, label_smoothing=FLAGS.label_smoothing, weights=1.0)
I am still confused about what going wrong for training the attention layers.
@LinXiLuo indeed, it sounds like there may be something wrong with your code. If you have code on github somewhere, I am happy to take a quick look and see if I can spot something.
@andrefaraujo You are so kind! The reimplement has been pushed on my repo, including prepossessing, train, eval. https://github.com/LinXiLuo/my_delf/tree/master/research/delf/delf/python/training If potential bugs are found, please let me know. Thanks a looooooot!
@LinXiLuo
I took a quick look, couldn't find anything wrong with it. One idea I had is if you could inspect the checkpoints to make sure training is indeed happening, variables are being updated. Here's a handy TF tool for this: inspect_checkpoint
Note that in the attention training stage only the attention and classifier layers should be changing, so you could also check that in the second stage the RN50 is not changing.
@andrefaraujo
No more words can describe my gratitude to you. Actually, I have visualized the training process on tensorboard. The changes of the two attention and classifier layers are very slow and subtle.
The dataset is correct, but the data we end up using was cleaned and released by the DIR paper, "full" and "clean" subsets. I believe the data (ie, URLs and labels) continues to be available on their website here. We used the train and validation sets defined in the same DIR paper (hyperparameters were tuned on the validation set).
We trained with different settings. When using a single GPU, training would take ~ 20 hours for each stage (fine-tuning and attention training). This would make for ~50 epochs in the Landmarks-Clean dataset, and ~12 epochs in the Landmarks-Full dataset. If using multiple GPUs, this can be sped up by a lot. We tried learning rates from 1e-1 to 1e-4, and picked the best run in the validation set. Note that we started from a pre-trained Imagenet model.
Can you upload the train code? Thanks.
Hi @andrefaraujo I am able to run the two-step finetune for DELF on the landmark dataset. The first step to finetune the original ResNet-50 as classification network seems ok and converge to ~90% top1 accuracy on Landmark Clean dataset. However, the second step to train the attention layers seems strange. I built the model by AttentionModel() function from delf_v1.py (should be correct to use this one?). On single GPU (GTX1080ti), I can only run batch size ~10 for the 900x900 center cropped and 720*r random down scaled input images. It is not converging anyway, testing acc stays extreme low. Can you share some light on how you set your training parameters for attention layers with single GPU? And I'm finetuning original ResNet by Landmark Clean dataset and the attention layers by Landmark Full dataset, is this same with your setting? Thanks!
Hi @SunLoveSheep Could you please share your training code
Great to here that, the basic overflow you can learn from one of the best video I have seen URL: https://youtu.be/tfwQA3jy4Ks
Hello, I wanted to provide an update to this thread. We have just released a TF2-compatible version of the DELF library, which now also includes starter training code: https://github.com/tensorflow/models/tree/master/research/delf/delf/python/training
Please feel free to open another issue in case you encounter problems, happy to help.
Hi, I am not very sure how this PR curve was obtained (Figure 5), was the model trained on the LF and LC dataset from the DIR paper and validated on the Google Landmarks v2 dataset ?
This curve was obtained with the initial version of the Google Landmarks dataset, which we used in the DELF paper. The dataset was modified before releasing it as GLDv1, so the curve cannot be reproduced exactly with the current externally-available dataset.
Hi, the LC and LF datasets are no longer available here. There are some urls from this .txt files that are not working anymore, and also there are a lot of corrupted images. Is there another place where the LC and LF are stored as images not urls?
Not that I know of. You should contact the original authors from the LC an LF datasets for that.
Hi, when training with Google Landmarks v2 what dataset was used during validation ? (During Fine-Tune and Attention training)
Please see the answer to that in these detailed instructions.
Are the DELF authors able to give a little more detail about how they train their model? Any insight into things like
would be super helpful. Any specific pointers to other projects (maybe in this repo?) that used a roughly similar procedure would be helpful as well.
EDIT: Also, can you verify that both the fine-tuning and attention models were trained on this dataset, rather than the Google-Landmarks dataset introduced in your paper.
Thanks Ben
cc @andrefaraujo