iterative / ldb-resources

Apache License 2.0
28 stars 7 forks source link

Add embedding projector. #41

Closed daavoo closed 2 years ago

daavoo commented 2 years ago

Isolated version of https://github.com/tensorflow/embedding-projector-standalone. Include utility script to go from ldb dataset in pairs format to the files expected by the projector.

Usual workflow should be to add some embedding with --apply + running the dataset_to_projector script:

$ ldb instantiate ds:chihuahua-muffin \
--apply python ~/Desktop/iterative/ldb-resources/apply-plugins/clip_embed.py \
-t chihuahua-muffin
python embedding_projector/dataset_to_projector.py muffin \
embedding_projector \
"label" \
"clip-embedding"
Captura de Pantalla 2022-08-26 a las 16 32 25
shcheklein commented 2 years ago

@daavoo does it help to see outliers or there are other cases?

also, does it cluster all the images in the dataset?

daavoo commented 2 years ago

@daavoo does it help to see outliers or there are other cases?

It all depends on:

Use cases and the clusters to appear will vary depending on that.

I have used embeddings from a pre-trained model (CLIP) in this example.

This is useful to get an intuition of whether the task is feasible to address by fine-tuning, by checking if clusters appear easily and clearly separate labels like it are for this case (enabled option to color by label):

Captura de Pantalla 2022-08-30 a las 12 48 35

It can also allow the detection of potential clusters of interest (or even outliers). In the example above one of the small clusters only contains blueberry muffins (and an outlier with only blueberries but no muffins):

Captura de Pantalla 2022-08-30 a las 12 49 06

also, does it cluster all the images in the dataset?

The utility script I added takes the whole instantiated ldb dataset and adds all of them to the viewer, so yes all images in the folder will be used.

I would say that the viewer starts to be impractical to navigate when there are more than 1 thousand images.