junxnone / aiwiki

AI Wiki
https://junxnone.github.io/aiwiki
18 stars 2 forks source link

paper YOLOWorld #465

Open junxnone opened 5 months ago

junxnone commented 5 months ago

YOLO-World

现状

相关工作

模型架构

*   Image Encoder(Darknet Backbone - YOLOv8 Detector)
*   Text Encoder(CLIP): Frozen
*   n-gram 算法提取名词短语
*   Text Contrastive Head
*   RepVL-PAN(增强文本和图像表示)

    *   T-CSPLayer: Text-guided Cross Stage Partial Layer : Text info --> Image feature 
    *   max-sigmoid attention
    *   I-Pooling Attention(Image Pooling Attention)

Image

Image

预训练方法

*   Total Loss = Contrastive Loss + λ \*  (IoU Loss + Distributed Focal Loss)
*   Pseudo Labeling: 
*   利用 n-gram 提取名词短语
*   利用 OVD(GLIP) 生成 Pseudo Boxes 提供初略的区域-文本对
*   采用预训练的 CLIP 来 评估 image-text /region-text 对,过滤掉低相关性的标记
*   NMS 过滤冗余BBox
*   Traing Datasets: Object365/GQA/Flickr/CC3M
*   Test Datasets: LVIS/COCO

测试结果

Image

YOLO World S/M/L/X/XL and v1/v2

Reference