switchablenorms / DeepFashion2

DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf
2.26k stars 354 forks source link

Question about the design of Match-Net and the features fed in. #31

Open xwjabc opened 4 years ago

xwjabc commented 4 years ago
  1. According to the paper, the feature extractor of match-net has 4-conv layers, one pooling layer and one fc layer. Are these layers: -- Conv1: 3x3 conv - 256 channels -> ReLU -- Conv2: 3x3 conv - 256 channels -> ReLU -- Conv3: 3x3 conv - 1024 channels -> ReLU -- Conv4: 3x3 conv - 1024 channels -> ReLU -- Pooling: GlobalAvgPool -- FC: 1024 to 256 channels (No ReLU) Besides, the similarity learning net have: -- Substraction (output 256 channels) -- Element-wise square (output 256 channels) -- FC: 256 to 1 channels (No ReLU) -- Sigmoid function. Am I correct?

  2. In the mask head, it has the procedure: backbone -> RoI Pooling -> 4x conv (feature extractor) -> 1x deconv + 1 conv (predictor) So in the paper, for the experiments using mask features, the RoI features fed into the match net should be the features after RoI Pooling. Am I correct? Do we have individual RoI Pooling for match net or just re-use the RoI Pooled features from mask head?

geyuying commented 4 years ago
  1. You are correct. Just re-use the RoI Pooled features from mask head because after the second stage, features from RoI Align already contain mask information. We tried using features from other layers, but got worse performance.
geyuying commented 4 years ago

-- Conv1: 3x3 conv - 256 channels -> ReLU -- Conv2: 3x3 conv - 256 channels -> ReLU -- Conv3: 3x3 conv - 256 channels -> ReLU -- Conv4: 3x3 conv - 1024 channels -> ReLU -- Pooling: GlobalAvgPool --ReLU -- FC: 1024 to 256 channels (No ReLU) +BN Besides, the similarity learning net have: -- Substraction (output 256 channels) -- Element-wise square (output 256 channels) -- FC: 256 to 2 channels (No ReLU)(The first channel means similarity, the second channel means difference. Positive pair label (1,0) ,negative pair label(0,1) -- Softmax function.

xwjabc commented 4 years ago

Thank you for your great help! Besides, I have two more questions:

  1. In the first version of the answer of the match network, I noticed that there are several tile operations:

    INFO net.py: 263: self1 : (64, 256) => self_user : (8, 8, 256) ------- (op: Reshape)
    INFO net.py: 263: self_user : (8, 8, 256) => self_user_ : (8, 8, 256) ------- (op: Transpose)
    INFO net.py: 263: self_user_ : (8, 8, 256) => self_user_after : (64, 256) ------- (op: Reshape)
    INFO net.py: 263: self_user_after : (64, 256) => self_user_after_ : (512, 256) ------- (op: Tile)
    INFO net.py: 263: self2 : (64, 256) => self_shop_before : (64, 2048) ------- (op: Tile)
    INFO net.py: 263: self_shop_before : (64, 2048) => self_shop : (512, 256) ------- (op: Reshape)

    Could you explain a bit of the use of tile function? Besides, I see the final output has shape (512, 2). However, according to the discussion, we should have 4096 pairs (512 positive pairs and 3584 negative pairs), which will lead to a shape of (4096, 2). I wonder the reason of such gap.

  2. In the evaluation of the retrieval, does Match R-CNN compare the user instance with all shop instances, or only compare the user instance with the shop instances which has the same predicted class as the user instance?

geyuying commented 4 years ago
  1. 4019 is proper. In our experiment, in oder to reduce the number of pairs, we do not use all pairs.
  2. compare the user instance with all shop instances
xwjabc commented 4 years ago

Thank you for your great help! In my current implementation, I use the mask features after RoIAlign in the mask branch. However, the number of instances in the mask features is limited (1~2 instances per gt garment (unique pair_id + style) in total at the beginning of the training). Thus, I wonder how you can generate 8 instances per image for the retrieval task? Thx!

joppichristian commented 4 years ago
  1. 4019 is proper. In our experiment, in oder to reduce the number of pairs, we do not use all pairs.
  2. compare the user instance with all shop instances

How did you compare all the user instance with all shop instances? It means an enormous number of comparisons. I have 4x Titan RTX and tqdm estimates 6000 hours to complete the evaluation. Have I missed something?