Open z-kun opened 6 years ago
Hi z-kun, We use the same model both for extracting the candidate objects and computing their appearance features. This model is indeed "vgg16_fast_rcnn.caffemodel", a VGG16 network pre-trained on ImageNet and finetuned on the VRD training set.
get it, thanks!
Hello jpeyre, I have read your paper and program, it is a nice idea to import spatial features to visual relation detection.
After reading, I confuse the sentence "To detect and localize such triplets in test images, we assume that the candidate object detections for s and o are given by a de- tector trained with full supervision. Here we use the object detector Faster-RCNN [14] trained on the Visual Relationship Detection training set [31]." in Part 3 of this paper. But, in the section Representing Appearance of Objects, you use Fast-RCNN with VGG16 pre-trained on ImageNet to extract the appearance feature. So you mean that you use the same CNN structure(Fast-RCNN) trained on different datasets in these two different steps?
I just find the "vgg16_fast_rcnn.caffemodel" in the program, but do not find the model trained on Visual Relationship Dataset. I wonder if I misunderstand the paper. Could you tell me some details about the model trained on VRD used for extracting the candidate pairs of objects? Thank you!