voxel51 / fiftyone-brain

Open source AI/ML capabilities for the FiftyOne ecosystem
https://fiftyone.ai/brain.html
Apache License 2.0
128 stars 3 forks source link

Feature/leaky splits #203

Open jacobsela opened 2 weeks ago

jacobsela commented 2 weeks ago

Very WIP. Putting this PR up to get feedback.

General idea, decided with @jacobmarks today: Input: user provided tags, field, or views corresponding to splits Output:

  1. List of all leaks
  2. Have option to tag/remove leaks

Plan forward:

  1. Finalize "interface"
  2. Implement field and view as inputs
  3. Implement tagging + removal
  4. Implement duplicates finding with new hash function from @mwoodson1
  5. Take hash functionality out of LeakySplitsSKL. Integrate class with interface.
mwoodson1 commented 1 week ago

@jacobsela when finalized could you provide a code snippet for how you imagine this feature working?

jacobsela commented 1 week ago

@mwoodson1 Basic snippet I used in the demo:

import fiftyone as fo
import fiftyone.brain.internal.core.leaky_splits as ls

config = ls.LeakySplitsSKLConfig(
    split_tags=['train', 'test'],
    model="resnet18-imagenet-torch"
)

# skl backend
index = ls.LeakySplitsSKL(config).initialize(dataset, "foo")

index.set_threshold(0.1)
leaks = index.leaks

session = fo.launch_app(leaks, auto=False)

# hash backend
config = ls.LeakySplitsHashConfig(
    split_tags=['train', 'test'],
    method='image',
    hash_field='hash'
)

index = ls.LeakySplitsHash(config).initialize(dataset, "foo")

session = fo.launch_app(index.leaks, auto=False)
mwoodson1 commented 1 week ago

The interface seems a bit messy to me. I was hoping for something like

dataset = foz.load_zoo_dataset(...)

leaks = fob.compute_data_leaks(
    dataset,
    method, # use hash or embedding soft similarity
    brain_key, # which similarity index / embeddings to use,
    model, # which model to use to compute embeddings
    ...
)

This would follow similar patterns to fob.compute_visualization and fob.compute_uniqueness. For example see the work happening in #201

jacobsela commented 1 week ago

@mwoodson1 Thanks for the feedback, I agree that this isn't ideal. I'm holding off on creating the final compute_leaks (or compute_leaky_splits as it currently is in the code) until we finalize what we want the behavior to look like (e.g. in terms of thresholds). Putting together a final easy to use function at the end should be quick so I'd rather do it once.