voxel51 / fiftyone-brain

Open source AI/ML capabilities for the FiftyOne ecosystem
https://fiftyone.ai/brain.html
Apache License 2.0
128 stars 3 forks source link

Adding Representativeness Method #182

Closed mwoodson1 closed 3 months ago

mwoodson1 commented 3 months ago

This PR add's a method to compute "representativeness" for each sample in the dataset.

There are two basic implementations available (though there could be more). They both rely on first computing clusters of the data using MeanShift and then computes representativeness based on how close each point is to it's cluster center.

As a simple example:

import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

dataset = foz.load_zoo_dataset('cifar10', split='train')

# "cluster-center"
fob.compute_representativeness(dataset, representativeness_field="rep_cluster", method="cluster-center")
brimoor commented 3 months ago

Adding @jacobmarks @griffbr to review the ML aspects of this work

mwoodson1 commented 3 months ago

Added documentation here. I was running into issues actually getting the docs to build so if someone could double check those that would be great.

griffbr commented 3 months ago

Representativeness using resnet18-imagenet-torch:

model_name = "resnet18-imagenet-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
    dataset,
    embeddings=embeddings,
    num_dims=2,
    brain_key=model_name.split("-")[0],
    verbose=True,
    seed=51,
)
fob.compute_representativeness(dataset, 
    representativeness_field="rep_cluster", 
    method="cluster-center-downweight", 
    embeddings=embeddings,
    model=model,
)
session = fo.launch_app(dataset)
IPython.embed()

Screenshot 2024-07-26 at 9 52 57 PM Screenshot 2024-07-26 at 9 53 48 PM

Seems to bias towards automobile class for whatever reason. Also interesting that the algorithm continues to select relatively interior examples in UMAP embedding vs exterior (but UMAP is not the same thing as raw embedding space). Is this behavior expected?

griffbr commented 3 months ago

Another example using clip:

model_name = "clip-vit-base32-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
    dataset,
    embeddings=embeddings,
    num_dims=2,
    brain_key=model_name.split("-")[0],
    verbose=True,
    seed=51,
)
fob.compute_representativeness(dataset, 
    representativeness_field="rep_cluster", 
    method="cluster-center-downweight", 
    embeddings=embeddings,
    model=model,
)

Screenshot 2024-07-26 at 10 08 47 PM Screenshot 2024-07-26 at 10 09 19 PM In this case it looks like the rep score is weighing heavily on the right UMAP cluster (top image, filtered by rep score) and leaving out a bunch of automobiles in the left cluster (bottom image, lassoed samples). Seems like model used to make the embeddings makes a big difference (no surprise). Would again be curious if the above results are more or less what you expect.

mwoodson1 commented 3 months ago

Seems to bias towards automobile class for whatever reason. Also interesting that the algorithm continues to select relatively interior examples in UMAP embedding vs exterior (but UMAP is not the same thing as raw embedding space). Is this behavior expected?

Yes, in fact that is the algorithm. It up-weights embeddings close to cluster centers which should be more interior than exterior.

In this case it looks like the rep score is weighing heavily on the right UMAP cluster (top image, filtered by rep score) and leaving out a bunch of automobiles in the left cluster (bottom image, lassoed samples).

Maybe not expected or unexpected but right now the clustering algorithm uses Meanshift so it's opaque how many and where the clusters end up. I am adding some controllability for this to make it easier to potentially pick up these clusters. e.g. if you see clusters being ignored in UMAP space you can increase K in K-means to increase the potential of finding it.

mwoodson1 commented 3 months ago

After discussing with @griffbr , I changed the way normalizing the ranking scores works so now the highest ranked samples should be more evenly distributed across embedding spaces. As an example here is what it looks like after running the example Brent gave above

Screenshot 2024-07-30 at 3 14 46 PM

The highest ranked samples are now seen in all clusters instead of just the densest ones.

griffbr commented 3 months ago

@mwoodson1, finding slight variation between "identical" runs (see results and commands below). Just documenting that this is expected behavior: Screenshot 2024-07-31 at 11 09 18 AM

fob.compute_representativeness(dataset, 
    representativeness_field="rep_cluster", 
    method="cluster-center", 
    embeddings=embeddings,
    model=model,
)
fob.compute_representativeness(dataset, 
    representativeness_field="rep_cluster3", 
    embeddings=embeddings,
)
fob.compute_representativeness(dataset, 
    representativeness_field="rep_cluster4", 
    method="cluster-center", 
    embeddings=embeddings,
)
griffbr commented 3 months ago

Score distribution looking much better with your recent changes, thank you!!! Screenshot 2024-07-31 at 11 12 58 AM Screenshot 2024-07-31 at 11 13 21 AM

model_name = "clip-vit-base32-torch"
model = foz.load_zoo_model(model_name)
embeddings = dataset.compute_embeddings(model)
results = fob.compute_visualization(
    dataset,
    embeddings=embeddings,
    num_dims=2,
    brain_key=model_name.split("-")[0],
    verbose=True,
    seed=51,
)
fob.compute_representativeness(dataset, representativeness_field="rep_cluster_dw", method="cluster-center-downweight", embeddings=embeddings)