What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs - NIPS 22'

Paper : link
Demo : link
Video : link

Get Started

datasets format are followed by link

For training a model :

python train_grounding.py -bs 32 -nW 8 -nW_eval 1 -task vg_train -data_path /path_to/vg -val_path /path_to/flicker
python train_grounding.py -bs 32 -nW 8 -nW_eval 1 -task coco -data_path /path_to/coco -val_path /path_to/flicker

For Grounding evaluation with our model [XX is the number of the results folder i.e 'gpu22' - XX == 22]:

python inference_grounding.py -task grounding -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 0 -path_ae XX -nW 1
python inference_grounding.py -task grounding -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 0 -path_ae XX -nW 1
python inference_grounding.py -task grounding -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 0 -path_ae XX -nW 1

For Grounding evaluation with CLIP model:

python inference_grounding.py -task grounding -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 1 -nW 1
python inference_grounding.py -task grounding -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 1 -nW 1
python inference_grounding.py -task grounding -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 1 -nW 1

For WWbL evaluation with our model:

python inference_grounding.py -task app -dataset refit -val_path /path_to/RefIt -Isize 224 -clip_eval 0 -path_ae XX -nW 1 --start 0 --end 9983
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/RefIt --dataset refit

python inference_grounding.py -task app -dataset flicker -val_path /path_to/flicker -Isize 224 -clip_eval 0 -path_ae XX -nW 1 -start 0 -end 1000
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/flicker --dataset flicker

python inference_grounding.py -task app -dataset vg -val_path /path_to/VG -Isize 224 -clip_eval 0 -path_ae XX -nW 1 -start 0 -end 17478
python wwbl_algo1_point_metric.py -nW 1 -predictions_path YY -val_path /path_to/VG --dataset VG

Phrase Grounding Results - Point Accuracy Metric

COCO weights

VG weights

Method	Backbone	VG(VGtrained/COCO)	Flicker(VGtrained/COCO)	ReferIt(VGtrained/COCO)
Baseline	Random	11.15	27.24	24.30
Baseline	Center	20.55	47.40	30.30
GAE	CLIP	54.72	72.47	56.76
FCVC	VGG	-/14.03	-/29.03	-/33.52
VGLS	VGG	24.40/-	-/-	-/-
TD	Inception-2	19.31/-	42.40/-	31.97/-
SSS	VGG	30.03/-	49.10/-	39.98/-
MG	BiLSTM+VGG	50.18/46.99	57.91/53.29	62.76/47.89
MG	ELMo+VGG	48.76/47.94	60.08/61.66	60.01/47.52
GbS	VGG	53.40/52.00	70.48/72.60	59.44/56.10
ours	CLIP+VGG	62.31/59.09	75.63/75.43	65.95/61.03

talshaharabany / what-is-where-by-looking

readme

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs - NIPS 22'

Get Started

Phrase Grounding Results - Point Accuracy Metric