Question regarding deep learning network architecture

Hi, I have noticed that your project employs a two-stage deep learning network. In the first stage, RCNN is used to detect logos and input boxes on web pages, and in the second stage, a Siamese network is used to recognize the brand names of the logos detected in the previous stage? I have read your paper and roughly understand the rationale of the design of each network stage. However, I still don't quite understand why it needs to be designed as two stages but not only one stage. Couldn't a single stage network handle this task? For example, if we train a target detection network (like YOLO) using only the logos in the target list to directly detect the brand names (use brand name as class) and then train another network to detect input boxes simultaneously, what would be the drawbacks of this approach?

(As a beginner in this field, my question might be somewhat naive😣)

Hi Sayoi,

Thank you for your question!

Indeed, object detection models like YOLO can perform object localization and classification simultaneously. However, there are several reasons why we chose a two-stage approach for this task:

Insufficient Training Data for Each Brand: Treating each brand as a distinct class would result in insufficient training data for each class. There may only be a few webpages showing the logo for each brand, leading to overfitting. If we meet another webpage where the logo's position or webpage design change slightly, the model might fail to recognize it correctly.

Distinctiveness of Object Classes: In typical object detection tasks, the object classes are highly distinctive (e.g., pedestrians vs. traffic lights). However, logos from different brands can share similar locations on webpages and have similar surrounding contexts. A single-stage network like YOLO (You Only Look Once) may struggle with the variability and subtle distinctions between similar logos.

Adaptability to Unseen Brands: We want our model to generalize well to unseen brands. The Siamese network excels at projecting logos with similar semantics to closer embeddings. Because it is designed to compare pairs of images and is highly effective in recognizing similarities and differences. In contrast, an object detection model trained for specific brands cannot generalize to new, unseen brands.

Specializations: By separating detection and recognition, each network can be optimized for its specific task. You can imagine the two-stage models as two workers with different specializations. The first stage (RCNN) focuses on detecting potential logos and input boxes with high precision, and the second stage (Siamese network) specializes in recognizing the brand with robust feature extraction.

lindsey98 / Phishpedia

Question regarding deep learning network architecture #28