Closed mwoodson1 closed 3 months ago
Adding @jacobmarks @griffbr to review the ML aspects of this work
Added documentation here. I was running into issues actually getting the docs to build so if someone could double check those that would be great.
Representativeness using resnet18-imagenet-torch
:
model_name = "resnet18-imagenet-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
dataset,
embeddings=embeddings,
num_dims=2,
brain_key=model_name.split("-")[0],
verbose=True,
seed=51,
)
fob.compute_representativeness(dataset,
representativeness_field="rep_cluster",
method="cluster-center-downweight",
embeddings=embeddings,
model=model,
)
session = fo.launch_app(dataset)
IPython.embed()
Seems to bias towards automobile class for whatever reason. Also interesting that the algorithm continues to select relatively interior examples in UMAP embedding vs exterior (but UMAP is not the same thing as raw embedding space). Is this behavior expected?
Another example using clip:
model_name = "clip-vit-base32-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
dataset,
embeddings=embeddings,
num_dims=2,
brain_key=model_name.split("-")[0],
verbose=True,
seed=51,
)
fob.compute_representativeness(dataset,
representativeness_field="rep_cluster",
method="cluster-center-downweight",
embeddings=embeddings,
model=model,
)
In this case it looks like the rep score is weighing heavily on the right UMAP cluster (top image, filtered by rep score) and leaving out a bunch of automobiles in the left cluster (bottom image, lassoed samples). Seems like model used to make the embeddings makes a big difference (no surprise). Would again be curious if the above results are more or less what you expect.
Seems to bias towards automobile class for whatever reason. Also interesting that the algorithm continues to select relatively interior examples in UMAP embedding vs exterior (but UMAP is not the same thing as raw embedding space). Is this behavior expected?
Yes, in fact that is the algorithm. It up-weights embeddings close to cluster centers which should be more interior than exterior.
In this case it looks like the rep score is weighing heavily on the right UMAP cluster (top image, filtered by rep score) and leaving out a bunch of automobiles in the left cluster (bottom image, lassoed samples).
Maybe not expected or unexpected but right now the clustering algorithm uses Meanshift so it's opaque how many and where the clusters end up. I am adding some controllability for this to make it easier to potentially pick up these clusters. e.g. if you see clusters being ignored in UMAP space you can increase K in K-means to increase the potential of finding it.
After discussing with @griffbr , I changed the way normalizing the ranking scores works so now the highest ranked samples should be more evenly distributed across embedding spaces. As an example here is what it looks like after running the example Brent gave above
The highest ranked samples are now seen in all clusters instead of just the densest ones.
@mwoodson1, finding slight variation between "identical" runs (see results and commands below). Just documenting that this is expected behavior:
fob.compute_representativeness(dataset,
representativeness_field="rep_cluster",
method="cluster-center",
embeddings=embeddings,
model=model,
)
fob.compute_representativeness(dataset,
representativeness_field="rep_cluster3",
embeddings=embeddings,
)
fob.compute_representativeness(dataset,
representativeness_field="rep_cluster4",
method="cluster-center",
embeddings=embeddings,
)
Score distribution looking much better with your recent changes, thank you!!!
model_name = "clip-vit-base32-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
dataset,
embeddings=embeddings,
num_dims=2,
brain_key=model_name.split("-")[0],
verbose=True,
seed=51,
)
fob.compute_representativeness(dataset, representativeness_field="rep_cluster_dw", method="cluster-center-downweight", embeddings=embeddings)
This PR add's a method to compute "representativeness" for each sample in the dataset.
There are two basic implementations available (though there could be more). They both rely on first computing clusters of the data using MeanShift and then computes representativeness based on how close each point is to it's cluster center.
As a simple example: