Some questions on the paper

andrefaraujo commented 5 years ago

Thanks for this great work, seems like a valuable contribution to the computer vision community!

I have a few detailed questions on the paper, which I hope you could clarify (apologies if I missed something in the paper):

There are 491K images. How many are consumer, and how many are commercial?
The paper states that there are 801K items and 43.8K identities (ie, 18.3 items per identity on average). But the paper also states that "each identity has 12.7 items". I got confused here, shouldn't this be 18.3 instead?
For the retrieval task, the metric is "top-K accuracy", but no exact definition is given. Is this the exact same definition as ImageNet case? Example: if a correct item is retrieved in position 9, my guess is that for this query the top-1/.../top-5/.../top-8 accuracies are zero and top-9/top-10/.../top-20 accuracies are 1. Is this correct?
For the experiments in the paper, is the network trained from scratch, or is training started from an ImageNet/COCO checkpoint?
"For the retrieval task, each unique detected clothing item in consumer-taken image with highest confidence is selected as query". What happens if the detector fails and selects some non-clothing region (false positive detection)? Is this false positive box query simply ignored in retrieval scoring?
I am a little confused by Table 5 ("Consumer-to-Shop Clothes Retrieval"). Do the different variants/rows ("class", "pose", "mask", "pose+class", "mask+class") correspond to models where only some losses are active? For example: for row corresponding to "pose", does this mean that only Lpose and Lpair are used during training? If yes, then how are boxes detected for the retrieval experiment in this case?
Other questions on Table 5: a few combinations seem to be missing: would you have "pose+mask" and "pose+mask+class" results?
Would you have results where the detector and match networks are trained separately? Ie, a case where first a model detects the boxes, crops the original image, then a separate model extracts features from the cropped boxes and does the matching. I am wondering if this would work better given that small objects could be captured better with a resized crop being fed via a different network.

Thanks in advance for the clarifications!

geyuying commented 5 years ago

334K images from shops and 157K images from consumers
There are annotated 801K clothing items in DeepFashion2. A specific clothing item that is shown both in consumer images and commercial images is called an identity. As is shown in the figure, the first three images are from consumers and the last two images are from shops. These five images have the same 'pair_id'. Clothing items with orange and green bounding boxes are items of one identity and can construct positive pairs. Clothing items whose bounding boxes are not shown in this figure do not belong to any identity. In this way, each identity has 12.7 items not 801/43.8=18.3 items. (You can refer to https://github.com/switchablenorms/DeepFashion2/issues/10 for details.)
Evaluation metric is the top-k retrieval accuracy in which we denote a hit if we find the exact same clothing in the top k results otherwise a miss. Your understanding is correct.
Network is pretrained from ImageNet checkpoint.
For clothes retrieval task, we provide a more realistic setting for evaluation: Instead of being provided the ground truth query clothing item, you should detect clothing items in images from consumers. For each detected clothing item, you need to submit the top-20 retrieved clothing items detected from shop images. When evaluation,for each ground truth query item(whose style is greater than 0), we will select a detected item on behalf of it for retrieval: First, a ground truth label will be assigned to each detected query clothing item according to its IoU with all the ground truth items. Then find out all detected items which are assigned the given ground truth label and are classified correctly. Finally select the detected item with the highest score among these detected items. The retrieved results of this selected query item will be evaluated. If IoU between retrieved item from shop images and one of the ground truth corresponding gallery item is over the thresh(we set thresh as 0.5), the retrieved result is positive.(If none detected item is assigned the given query item label, this query item is counted as missed. )
In all retrieval experiments, bounding boxes are detected from class branch. For row corresponding to "pose, bounding boxes detected from class branch are fed into landmark branch to extract features for retrieval.
"pose+mask" and "pose+mask+class" results are not available yet. We will add experiments later on.
It is a meaningful idea. We may do experiments later on.

Hope these explanations will be helpful for you.

andrefaraujo commented 5 years ago

Thanks for all of these answers! I understand much better now, but am still confused about a few things:

2 Let me check if my understanding is correct: There are 801K items in total. However, IIUC what you say, some items are not associated to any identity. So, there are 43.8K*12.7=556K items with associated identities, and 801K-556K=244K items with no associated identity. Is this correct?
5 IIUC, this means that a false positive detection is not considered in the scoring. Is this correct?
6 I got a little more confused here, sorry :) From what you are saying, it sounds to me like in all cases the Match RCNN is trained with all heads, right? For the "pose" case, would this mean that the match network (MN) features come from the 14x14x512/28x28x32 maps instead of the 14x14x256 maps from the detector? Also, I am wondering what "pose+class" means in this case: are the feature maps from the detector and the "landmark" branch merged in the input to the MN?

Thanks again :)

geyuying commented 5 years ago

It is correct.
For each ground truth query item, we will select a detected item on behalf of it (You can refer to the answer above for details). If a false positive detection is selected to represent one ground truth query item, it will be considered in the scoring.
Your understanding is correct. In fact, for the pose case, the match network features come from the 14x14x256(after ROIAlign) and for the class case, 7x7x256(after ROIAlign). They achieve better retrieval results compared with other feature maps.

lu-jian-dong commented 5 years ago

the dataset statistics show has 390,884 images, but the release data just is 191961 images

geyuying commented 5 years ago

Half of the training set has been released at present.

lu-jian-dong commented 5 years ago

1、your benchmark is trained at released dataset ？2。i train detected model use released dataset with mask-rcnn，the mAP is 0.60, far from your benchmark，can you provide some advice?

geyuying commented 5 years ago

Benchmark released in github is obtained by tranining whole dataset, not released dateset. Benchmark released in DeepFashion2 Challenge is obtained with released dataset. 2.Detection result is 0.638(map) with released dataset. Which config yaml do you use in mask-rcnn? You can try increase image size, pretrain from ImageNet, increase training time.

lu-jian-dong commented 5 years ago

thanks very much， used config yaml is mask_rcnn_r50_caffe_c4_1x.py, I just convert the released dataset to coco format ，trained at 8 GPU，total epochs is 12，img_scale is (1333,800)。your match_rcnn is realize by mmdetection?

geyuying commented 5 years ago

Match-rcnn is realized in Detectron https://github.com/facebookresearch/Detectron with e2e_faster_rcnn_R-50-FPN_1x.yaml

andrefaraujo commented 5 years ago

Thanks @geyuying for all answers to my questions! I understand everything much better now :)

LouisLang1002 commented 5 years ago

Match-rcnn is realized in Detectron https://github.com/facebookresearch/Detectron with e2e_faster_rcnn_R-50-FPN_1x.yaml

Could you please share your setting of the training? (input size, training epochs, learning rate) Thanks a lot.

geyuying commented 5 years ago

@LouisLang1002 input_size (800,1333) initial learning rate 0.02 and 0.002 after epoch 16 ,0.0002 after epoch 22, and ends at epoch 24.

LouisLang1002 commented 5 years ago

@LouisLang1002 input_size (800,1333) initial learning rate 0.02 and 0.002 after epoch 16 ,0.0002 after epoch 22, and ends at epoch 24.

Thanks for your guidance. Have you try a deeper backbone instead of R-50? And why you use faster r-cnn instead of mask r-cnn？Looking forward to your reply.

geyuying commented 5 years ago

@LouisLang1002 Deeper backbone leads to OOM in my experiments. Actually, we use mask r-cnn, but e2e_faster_rcnn_R-50-FPN_1x.yaml in detction model, not faster r-cnn.

LouisLang1002 commented 5 years ago

@LouisLang1002 Deeper backbone leads to OOM in my experiments. Actually, we use mask r-cnn, but e2e_faster_rcnn_R-50-FPN_1x.yaml in detction model, not faster r-cnn.

I'm confused, which model did you use to get mAP63.8 on release dataset?

ModelA Model B

geyuying commented 5 years ago

@LouisLang1002 model A for detection and get map 63.8. Sorry that I misunderstand your question. We didn't use mask r-cnn because we want to evaluate detection performance only with bounding box, not mask.

vinjohn commented 5 years ago

@geyuying Hi, I see that different categories have different number of landmarks, how do you train the landmarks? Use the max num of landmarks of all the category for landmark output channel so that you can train all the category at the same time? or train each category separately with different models? Thanks!

geyuying commented 5 years ago

@vinjohn Using the max num of landmarks of all the category(294 in total) for landmark output channel

ronnie-tian commented 5 years ago

Your understanding is correct. In fact, for the pose case, the match network features come from the 14x14x256(after ROIAlign) and for the class case, 7x7x256(after ROIAlign). They achieve better retrieval results compared with other feature maps.

I'm confused about you explanation, how you combine these two features and fed into MN since the feature map sizes are different?

anikola commented 4 years ago

@geyuying Could you please tell me the STEPS_PER_EPOCH and VALIDATION_STEPS numbers that you have used for training? I mean during a single epoch you train all the dataset or the STEPS_PER_EPOCH is 3750 (90000/24)? Also if I understood correct, you train only the heads of Mask-RCNN? Thank you!!

switchablenorms / DeepFashion2

Some questions on the paper #14